♾️🧞‍♀️✨ — Extracting a list from a webpage

10 Sep 2019

The problem

You have a webpage with a list of things: values, prices, emails, or links. And you want to copy that into a string you can use elsewhere like a spreadsheet or data.

Table of names

There’s not an API you can use to fetch these. You know that you can construct a CSS3 selector to get them all. So you can use the developer view of the page (a.k.a. F12) and use JavaScript on the console tab as your ‘API’.

Extracting the list

You look at the page in your browser’s inspector and the email addresses you want to pull out are coded as:

<table>
<tr><td><a class="email" href="mailto:a@b.tld">a@b.tld</a></td></tr><td><a class="email" href="mailto:e@m.tld">e@m.tld</a></td></table>

CSS3 selector is 'a.email'. That is you want to pull every A element with the class name email out of the current page. And each of those A elements has an href of the form mailto:name@example.tld.

So we’ll get the list and iterate over it, chopping up the href values and turning it into a list.

We open the JavaScript console on the page and run this one-liner.

$('a.email') // <= $() is console shorthand for document.getElementsBySelector()
.map((el) => { return el.href.split(':')[1]; })
.join('\n');

But the browser reports an error here, because $('a.mail') is a node list, not an array.

You can use Array.prototype.from() to make that node list into an array.

Array.from($('a.email'))
.map((el) => {
    return el.href.split(':')[1];
})
.join('\n')

Now you’ll get a list of email addresses, unsorted, and with duplicates.

e@m.tld
a@b.tld
c@d.tld
a@b.tld

You could clean that up in a text editor but let’s go further.

Cleaning the list

Sorting is simple.

Array.from($('a.email'))
.map((el) => {
    return el.href.split(':')[1];
})
.sort()
.join('\n')

That doesn’t get rid of the duplicates.

JavaScript supplies the filter method, but to use it, we’d have to define an accumlator on a separate line, so we don’t get a nice, context-minimal one-liner.

ES6 provides a new object, Set. Sets don’t allow duplicate values. And it takes any iterable type as an input.

new Set([1, 1, 2, 2, 3]) // => Set(3) [1, 2, 3]
new Set('committee') // => Set(6) [ "c", "o", "m", "i", "t", "e" ]

So we can de-dupe the list using that, and turn it back into an array to sort and join it into a string.

But what does Set use to de-dupe?

It turns out that new Set(*node list*) is an empty set. This is because of how the comparison operator works when creating the set from an iterator.

So you have to process the the list into an array of strings before you turn it into a set.

Array.from(new Set(Array.from($('a.email'))
.map((el) => {
    return el.href.split(':')[1];
})));

Then you can sort the array of unique text values, then join it into a string.

The complete one-liner, formatted for legibility, is:

Array.from(new Set(Array.from($('a.email'))
.map((el) => {
    return el.href.split(':')[1];
})))
.sort()
.join('\n');

Which will return:

a@b.tld
c@d.tld
e@m.tld