Extracting links from text

Typing or carefully copying hyperlinks can be error-prone. I'd rather have a tool to extract the URL from a blob of text. Some contexts will put various text formats into the clipboard when copying. For example, webpages will put an HTML version of the copied text. Additionally, sometimes I copy text that contains multiple URLs I might use.

Wikipedia entry discussing data formats in clipboards

Text 2 links takes a block of text from STDIN. The intention is to source this text from something imprecise, with lots of redundant information, perhaps a file or the clipboard. It uses Linkify to discover all the valid URLs present in the text. The output URLs are printed to STDOUT. The list of URLs is delimited by a newline, but this can be customised.

text2links github repo

linkifyjs

One usage idea is to use an OCR tool like tesseract to grab some text, then feed that text into this tool. Unfortunately it doesn't have fuzzy URL detection or path-only detection. A space before the .com makes the URL undetectable, or a space in the protocol (or an unknown protocol) means that a default "http://" will be used. Therefore, in some cases, the tool won't be helpful.

Tesseract OCR library

I'm not sure if this tool is more useful than something that helps speed up the process of writing a URL manually. I imagine using shortcuts to autocomplete the protocol and TLD could make typing easier, or presenting long paths and query strings nicely so errors can be spotted. In any case, this project leads nicely into another one, linkinfo to HTML, which converts a specifically formatted file type into an HTML page that is just a list of links.

Convert linkinfo files to HTML