Automating the retrieval of content from HTML pages isn’t something that often comes up, but it does happen. Sometimes people even want to screen scrape their own systems as a way of integration (which is not always the best idea).
This automated retrieval of web content and processing it for another purpose is often referred to as screen scraping. Usually, screen scraping involves capturing text content, but it can also include the processing of images and videos.
Probably the most common use of technology is if you are building a search engine, which crawls the web to examine the content that exists.
There are a lot of ways to do screen scraping. A straightforward example is writing some custom code to perform HTTP requests and then manually parse those requests. That would be a way to do it, but luckily, there are many better options out there.
Perhaps unexpectedly, Google Chrome powers a lot of software (in addition to the browser) that is being developed and used today. Visual Studio Code editor is, to a degree, Chrome under the hood. Microsoft Teams also uses Chrome under the hood.
But another use for Chrome under the hood is as a screen scraping engine.
Google has wrapped up a headless (no UI) version of Google Chrome called Puppeteer. This package can be automated to perform various tasks such as screen scraping. And since Puppeteer is built on top of a web browser, doing some things such as opening pages and clicking on content is pretty straight forward.
In this post, we will assume we want to scrape some blog posts. We will start by scraping a list page and continue to follow “older” links until we run out of links. Then we will go through each of those list pages and open each individual blog post. We will create a directory for each of those blog posts and write those posts’ contents to disk.
We’ll start by installing Puppeteer.
npm -I puppeteer
Then we do some quick set up.
Then we will need to navigate to the list page.
Next, we will build a list of the list pages.
For each list page, navigate to each blog post.
Parse an individual blog post and save it to disk.