Here’s a question: What would happen if we ran the entire online British news media through machine learning algorithms designed to detect meaning and values within each article?
What undiscovered treasures and trends would we find?
What previously hidden learnings reveal themselves to us?
What golden nuggets of insight would we uncover?
Over a decade ago, in 2008, Google released the Google Cloud platform, a “world-class infrastructure and robust set of solutions to build, operate, and grow your business in today’s complex, multi-cloud environment”, as they eloquently put it.
In other words: they make tools to store, process and do all manner of smart things with data.
Earlier this year, Google launched the AI Platform, a new set of tools within the Cloud platform at large that let you do really smart things with data, things such as image recognition, language translation, text-to-speech, and, what caught our attention: natural language processing.
“Natural language processing” – sounds kinda fancy. Let’s break it down.
“Natural language” can be thought of as any language that humans use to communicate with one another. Things like recorded conversations, text messages, tweets, and so on are all examples of natural language.
If I were to give you a piece of text and ask you to point out the most important nouns and adjectives within it, you could be said to be “processing” the meaning of that text. If we were to give an algorithm that very same task, what you’d have is a computer processing natural language.
This is easy for a human; hard for a computer.
Smash these two concepts together and what you get is natural language processing (NLP for short): you give the computer a piece of text, it will spit out a bunch of words and phrases (or “entities”) it believes the text is about…
…or so the theory goes.
We wanted to test it out. Specifically, we wanted to use Google’s Cloud Platform to process the natural language of the online British news media.
We’re still pretty broad, so let’s get even more specific: Let’s use Google’s NLP API to parse 400 articles written about “Boris Johnson”.
Why Boris? Because he’s topical and a ham for publicity. Why 400? Well, we need to start somewhere…
…so let’s begin.
At the end of this sprint, we want to have two datasets:
The first: the body text of 400 news articles written about Boris Johnson.
The second: for each of these 400 articles, a list of entities derived from Google’s NLP API.
To get the text of 400 news articles, one could navigate to 400 individual URLs, highlight the entire article, and copy/paste the text into a cell in a spreadsheet.
But, this would be way too manual, fiddly and awkward.
To get the text of 400 news articles, one could write a web scraper using Python’s requests and BeautifulSoup modules, both of which in combination would programmatically copy/paste the article text into a data file.
One limitation of web scraping is that the target URL’s HTML needs to be well formatted and structured. A brief look at the page structure of Guardian.com articles showed there was indeed a salient, easily identifiable HTML tag within which contained all the text of the article (<div class=”content article-body”). And this could be our route into for the web scraper.
However, that same tag contained a lot of information that wasn’t the direct article text, things like the <aside> tags that linked to other articles on the site.
This noise in the data would inevitably pollute the results of the NLP. While a web scraper could be written in such a way to handle these roadblocks to feedback clean data, the moment the page structure was changed, or the moment a new and unforeseen HTML element was encountered, the web scraper would break.
To get the text of 400 news articles, one could use The Guardian’s news API which would return pre-cleaned article data prepped and ready for Google’s NLP.
But this, of course, would limit us to just articles from a single source.
There was another alternative.
EventsRegistry: a searchable news database with over 30,000 sources.
Like The Guardian’s API, the EventsRegistry API gave us access to clean article data without the hassle of web scraping… and it gave us access to a vast number of potential news sources.
From these two APIs is where we decided to get our 400 Boris Johnson articles: 200 to come from direct from The Guardian; 200 to come from EventsRegistry.
After pinging these APIs, we got back a nice, clean JSON file with 400 objects, each object containing the three variables:
Now we had our data – our natural language – it was time to process it.
Although the statistics behind NLP is complex, the idea behind Google’s NLP API itself is fairly simple: give it text, it will give you the list of entities it deems that text to represent.
You can actually have a go yourself on their demo page.
For example, if we take this tweet from @BorisJohnson…
…and run it through the NLP demo page…
…Google spits out seven entities:
Seems about right!
(Google also feeds back other useful info such as “entity category”, which is like a super-entity, and a “salience score”, which is an approximation between 0 – 1 indicating how important that entity is to the overall structure of the text. Read more about these here.)
Using custom written Python scripts, we passed each of our 400 Boris Johnson articles through the backend version of this same API; here’s what it returned:
On the left, you have the plain old article text. On the right, you have the list of entities Google returned for that article. Why do we have so many entities now? Because: more text, more entities.
And for each of these 400 articles (saved in the aptly named articles.json), we have a single set of entities (saved in the equally aptly named entities.json).
A quick count of the total number of entities available to us: 55,824.
These two datasets were literally made for each other. So join us next week where they’ll meet for the very first time inside the wonderful Jupyter Labs for a spot of PDA and a spoonful of EDA:exploratory data analysis.