Automate your SEO tasks with custom extraction | Max Coupland at BrightonSEO 2019

Being my first time speaking at BrightonSEO, the team at Type A decided to document the event, catch feelings and emotions on the day and to record my entire talk.

The below text is the script I roughly used for my BrightonSEO talk. Each break in text represents when I moved on to the next slide, so if you wish to view the slides above in unison with what I said and when then you can!


Max discusses how you can use XPATHs and REGEX for custom data extraction and site crawling on large and complicated websites.


At Type A Media, we work a 4 day work week and we can get away with this because we’ve learnt to be as streamlined as possible where we can. 

We’ve built tools, processes and sheets, like some other companies are doing too,  

which allows us to leave the heavy lifting of a lot of our day to day tasks to the machines  

so that we can focus on applying our skills to the analysis of the data.  

Whenever we find ourselves doing manual data collection we stop and think ‘how can we automate this’. 

 Now that’s great, but why the hell do you care how many days a week I work? Please raise your hands if any of these next statements apply to you,  as a full time permanent contracted worker please raise your hands if you work 3 days a week, 4 days, 5 days. I thought as much, most are 5 days. So, my talk may be titled ‘automating your SEO tasks’  

but what I'm really trying to do today is get you guys back the most precious resource in our lives, time, and more specifically I'm going to help you get back 1 day a week, so that you have more time to do the things that matter to you. 

It’s completely up to you what you do with that extra day a week, if that means you can do more work for your clients, or you can spend more time with your family instead of having to stay in the office until 9pm most weekdays, or if you simply want to spend more time in the pub, I'm not here to judge, that is entirely up to you! So, whatever you want to do with your free time, I hope that using some of the concepts we’re about to discuss here can help you get back your day a week, the same way it’s helped us at Type A. 

Automating your processes is not something we do across the board because some things in SEO just can’t be categorised into a few neat buckets, unfortunately. 

However, a huge array of repeatable SEO tasks can be automated and still collect 100% reliable data, (also eliminating human error that occurs from looking at the same Google Sheet all day which will inevitably lead you to miss something), while having the added bonus of saving you potentially days of work 

giving you guys more time to play candy crush while convincing your bosses that you’re hard at work.

it’s not necessary to build your company’s own proprietary tool to be able to save yourself days of SEO data collection. All it takes is a very basic understanding of 2 things; XPATHs & Regular Expressions (or REGEX as it’s abbreviated to). Before I briefly explain these two concepts, I need to tell you why you might care about them. 

 They’re both used to extract information from the web or a document, that is custom to your exact, no matter how specific, needs. Understanding these 2 concepts will allow you to have virtually any information from any website, even Google SERP, at any scale, in the palm of your hands within a matter of minutes. If that’s not cool enough for you, how would you like to be able to crawl a website with millions of URLs, without having to wait a week for the crawl to finish, that’s if the crawler doesn’t crash in that time? 

 Hopefully some of you here are familiar with tools like SF, Deepcrawl etc. are brilliant tools, and they come with a vast array of default extractions on a website e.g. Title tags, HREFLANG tags, Canonical tags and very recently structured data etc. I’m sure everyone here will know this already.  

However, not every website is built the same way so it’s impossible for them to offer every possible SEO element for you to extract because sometimes they aren’t just for SEO. That’s why these tools offer people  

like you and me the opportunity to use ‘custom extraction’.   

Either in the form of XPATHs which Sabine defined in her great talk 

or REGEX  Having a very basic understanding of REGEX and XPATHs will allow you to get almost any information you could possibly want from a website (not just your own) en masse.

So in the next 15 minutes we’re going to just scrape the surface on how custom extraction can be used to save yourself a day a week. You’ll see how it can be used to help you crawl a website efficiently, 

how you can leverage Google’s SERP data with it, how it can be used in content optimisation,  

and even how it can help you in your content marketing. Basically it can be used to help you no matter what area of SEO you’re working on and I'm going to end by using some great use cases of custom extraction that I've seen in the wild. So let’s crack on:

Let’s start with crawling large websites. Now, there's no way I can expect my crawler to crawl this site without running out of memory and taking 2 days to run, so I need to find a way of reducing the size of this crawl to make it more maneageable. Enter custom extraction;  

One way I could do this is to use an exclude rule and take out all parameters from the crawl using this simple REGEX string. Full parameter stripping can be a risky business on the whole as you do sometimes need to be aware of what parameters your site is using and how they change the content, but for now, it’s ok as we’re new to the site and we just want to see what the breadth of the site is (aka, how many subfolders they have and sub-sub folders).  

So you can put this here REGEX exactly into your crawler and it’ll crawl the entire website until it reaches a parameter and it will simply ignore it and carry on with the crawl. 

As you can see here, some parameter URLs can cause havoc on the size of your website, now imagine you have a website where no enforced URL structure is used, so these 48 parameters can appear in different orders, thereby compounding the number of possible combinations / URLs on your site.  

If you want to include parameters in your crawl, but you don’t want them getting out of control, then you could also use this REGEX string which will only include URLs which have 3 or less parameters. Once again, this is great to do as an initial crawl of the site, but be warned about never crawling parameters. If you manage an e-commerce website with loads of product filters, then you know just how many URLs this creates on your website. 

Great, so using very basic custom extraction rules, I've now been able to drastically reduce the size of the crawl and I can see all the various subfolders being used on the website, now if I want to drill down into a specific subfolder and just crawl the URLs under that one directory then that’s also not a problem, I can use an include rule on top of my exclude rule to just crawl, let’s say, the Learn Chemistry subfolder 

Right, crawl done for now. 

 What next. so now I’ve crawled th learn chemistry section of the website and I can see educational pages in Learn Chemistry which I've found out from speaking to the client is a key area of the site to them, but oh heavens, all of this content is all in framed content, what an absolute nightmare, that is atrocious user experience and it’s very unlikely I'm going to rank this page for much am i. Anyway, I could spend all day flicking through these URLs to see which pages have similar framed content,  

or I could just use our trusted ally custom extraction to automatically pull all the pages on this website using this framed content and give them to us in a single exportable view. I won’t spend too long on this section of custom extraction, as Sabine’s fantastic talk went into great detail about this type of extraction. All we need to do for the machines is find the class being used in the HTML to display that content, so inspect element,  

select the class we see being used and pull its XPATH, and then let SF do all the hard work through its custom extraction feature, which is on the screen. 

So now we can just leave this crawl running for a bit.. So while that crawl is running, I’m just going to quickly detour to give you  

other examples of how this type of automatic custom extraction can be helpful to you, 

 so it could also be used to grab every page missing GA code, find iframes in the head of the pge, or rogue HREFLANG tags in the body etc. All you need to do is find the elements they are all wrapped around, and use their XPATHs in custom extraction and you’ll be saved all this time and hassle. Grab that free day a week.  

If you’re looking to run RegEx then DeepCrawl is a great tool for using RegEx to crawl your site or extract important information and it even comes with a live RegEx tester so you can see if your string is functional before implementing it in the wild. 

So back to our task at hand, now we have our crawler doing the custom hard work for us, and we can sit back and wait, maybe go grab yourself a coffee, watch Netflix on your phone while your boss isn’t watching. Or, we may be panicking that we still have shittt loads to do, so we might use this new free time to crack on with some keyword research. 

So at the same time as the custom extraction crawl, we can then do some keyword research for our ‘learn chemistry’ section to identify any key search terms we aren’t targeting with our existing content. I’m sure we’re all familiar with standard keyword research procedure here, the typical practices like gathering competitor keywords, using SEMrush to pull ranking positions, GKP for volumes, ATP for query based keywords, and also  

gathering People Also Ask queries and related searches which are being displayed on the SERP for your keyword list...? Is that last one raising a few eyebrows, it’s definitely information we want to have access to,  

but is there actually a way of seeing what PAA queries exist for your, let’s say, 2,000 strong keyword list without manual work? Well yes, there is, and it’s once again our friend custom extraction that’s going to help us find them, for all 2,000 of our keywords, while we do next to no work..  

By now we’ve done our introductory keyword research so the previous SF custom crawl is finished, we’ve saved that in our documents and we can now enter a new crawl. 

Google’s SERP is a webpage, and like most webpages, they use classes, id, elements to structure their pages. So that’s all we need to use our new custom extraction knowledge. So, Let’s inspect this PAA box here and see what class Google uses for its PAA boxes.  

Once again, we extract that class,  

and we’re going to combine it with this specially created URL for our keywords. This is Google’s root URL for Google search, and so we just have to re-create how the URL would appear were we to conduct a Google search for our keywords. 

And combine it with our keyword list and crawl away.  

Now we can navigate to the custom extraction tab and see that for our list of keywords, SF is pulling through the PAA queries listed on the SERP for each of our keywords. 

The day is still young, and there’s plenty more work we can be getting on with, saving us time in the future. 

Now you have a huge list of keywords, and related queries you can use in your content, that’s great! 

 Now you’ll need to vet them for competitor brand names and for user intent right. But identifying user intent is usually guess work, especially when it’s at scale, you can do it manually if you’re working on niche or small sites although that’s still not you optimising your time again! Has anyone here ever looked at a keyword and thought ‘that obviouslyyyy has informational intent’ and then been dead wrong, I have on multiple occasions. Google defines intent, based on what it believes users are searching for, but it’s still Google deciding what the intent is, so that’s all that matters to put it bluntly, so sometimes Google is a bit wrong, and sometimes it’s us that is wrong. So how can we not only get around having to guess what Google believes the user intent to be behind a keyword, but also mitigate the time consumption on manually vetting these keywords at scale.  

Custom extraction did I hear someone say?! Indeed once again custom extraction can help us here but before I go into this idea I need to credit it properly, I soooo wish I could steal the credit for this idea because it’s absolutely genius and it also shows that the limits of how custom extraction can be used to help you are basically only restricted by your own imagination. It’s the fantastic work of Rory Truesdale from Conductor, who I've linked/credited here.   

He took custom extraction a stage further and combined the extraction with formulas in Excel. 

You can pull all title tags from every SERP listing ranking for your keyword at the same time, automatically, using custom extraction and here’s the simple Xpath!  

I’m using Screaming Frog to do this extraction so here are a few adjustments to the crawl you’ll need to be aware of for this to work, 

 by the way if you don’t have Screaming Frog but you’d like to give this a go then you can download the free version which allows you to crawl 500 URLs for free, beyond that you’ll need a paid license.  

We just need to construct our URL for Google search, the same way we did for the PAA custom extraction 

And we can see all the top ranking results title tags for each keyword 

If you export that data into Excel

and use a few nifty formulas to identify words such as ‘what, how, why…’ etc in the title tags, then you can mark them up in the formula, likewise a formula for commercial terms.

If one keyword is flagged by the informational filter then we can see the keyword has informational intent! If, let’s say, most of the top 10 SERP titles for your keyword include the word ‘buy’ then that keyword is likely going to be commercial you’d think right? 

You can then use this data to segment your keyword list and show the opportunity in each segment of the client’s search space. 

I've purposefully left a few of these rows as ‘intent not found’ this is because it’s not a bulletproof technique, things rarely are in SEO, no matter how genius they appear to be, you’ll still need to vet your keyword list for the outliers, but would you rather be manually vetting 2,000 keyword for intent, or 20? 

At the end of this, you’ll have a list of intent categorised keywords  

So let’s re-cap what we’ve done so far - crawled an 8 million page website 

Identified content errors on our website and been able to provide a full list of individual instances to correct 

Pulled PAA queries for our 2,000 strong keyword list 

And used Rory’s intent example to categorise out list at scale 

And we’ve done all this before 4pm, crazy 

Those poor poor machines doing all our work for us 

So have we got back our day a week yet?  

Well, not quite, but we’re close 

We all know about the 3 pillars of SEO, and I promised we could use custom extraction for all 3,  

But there’s still one missing, outreach 

The second great use case of custom extraction is from Type A Media, shock horror.  

As a result of one of our ideation sessions we set out to create a tool which would put together what salary a parent would make if they got paid to raise her children. This tool needed to combine publicly available salary information from jobs such as chefs, taxi drivers, cleaners, educators etc etc, the trouble came with where we could get our own data from to make this tool so that the data was completely legitimate and not at all based on guess-work. To make matters more difficult, we need to not only find the data but be able to cut and clean the data so that is was usable. Obviously manually writing down each salary we came across was not an option, as we were aiming to split the salary data down by region.  

So we took to where, thanks to their very on-point URL architecture, it was easy to crawl the jobs we needed, from the regions we needed,  

and extract a range of salary information 

 from each job 

 listing using Xpaths.  

The end result was over 300,000 different job pages which had an average of 20 job listings on each page, thereby helping our data have a valid enough sample size (6 million) to be taken seriously by regional and national publications. 

We then cleaned the data so it was split by region and profession 

And you have yourself a ready to go outreachable asset 

To end, You can also use Google’s database search engine to find website offering stats and data on the subject you’re looking to build your outreach off the back of, and see how you could scrape that information for your benefit. Just be sure to clearly accredit your sources where necessary. 

So some parting points, we’ve covered off all 3 pillars of SEO 

We've let the machines do all the work for us 

The limits to what can be done with custom extraction are purely down to the limits of your imagination, like Rory showed us 

And hopefully you’ll be able to use this to get yourself back a day a week