Learn Basic Scraping with Puppeteer

By downloading meme templates

7 min readDec 28, 2021

--

⚠️ We will create a script to download meme templates from imgflip. Read carefully their terms before using content coming from it. ⚠️

Introduction

A while ago I wanted to make a meme contest with some friends, so I went on the web and I searched for a “pack” of meme templates to download. To my very big surprise, I found few results, I was pretty confident that I would have obtained a huge amount of material, instead, I’ve found just some packs, and I wasn’t really satisfied… So I’ve decided to have a look into web scraping to download some of them.

There is a high chance that actually there is a lot of material, but after some disappointing research this idea came to mind and took over so I stopped searching for it

Today I want to write down a small script that I’ve made to try out basic scraping and explain it. I’m going to use Puppeteer, but the concepts are relatable with other libraries too (it will probably just be a matter of syntax if you use a different one).

I’m a learn-by-doing person and so, to learn basic scraping, I’ve decided to create a script that downloads meme templates from the great imgflip website (I have no intention of harming them, it’s a site I use a lot and if you need to create memes I highly recommend to use it).

The code in this script leverages some ‘modern’ javascript features like async/await. I won’t cover language details in this article.

What is scraping

In the web ecosystem scraping refers to a technique in which (usually) automated processes fetches and copy some data from a website. In our case we want to create a script that opens imgflip meme templates page, downloads every template on the page, navigate to the next page, downloads every template on the page, navigate to the next page, etc...

What is puppeteer

Puppeteer is a library that lets you control Chrome (the browser) through some high-level APIs. If you want to read more there is no better source than its repository. I find it pretty intuitive and easy to use so I think it fits nicely for my use case.

Code

To use puppeteer you either need to create an npm project and install it as a dependency or you can install it globally and use it where you want.

a) To setup the npm project

code to create an npm project with puppeteer

b) to install globally and use it where you want

code to install puppeteer globally

Now we are ready to actually write down some code because we don’t need any other dependency for this small project.

Create a JavaScript file in your project and open it with your favourite editor/IDE (I suggest you use VSCode if you don’t have one yet)

code to create an index.js file

Now let’s write some code:

First, we import our dependencies

code to import dependencies

We import puppeteer for obvious reasons and fs because we need to access the file system to save the images.

Also create the folder where we’ll store the downloaded images, I’ll call memes.

The next thing we have to do is probably the most important and also boring: get all the selectors that we need to navigate through the website and to get the images. We need selectors for:

  • navigate to the next page
  • detect when we are on the last page
  • get the meme image
  • get the meme title (we will use this to name the images)

To do so I’ve inspected the page with the chrome dev-tools, I will spare you this part and show you the code directly

code to setup constants

Now let’s write the code to setup puppeteer and to handle the navigation between pages

code of the main function

The comments should be self-explanatory, basically, we set up Puppeteer, open the imgflip templates page, and while we have still “next pages” download all the memes we find on the current page. Close the browser at the end.

Let’s see how we download all the memes on a page

code to handle the list of meme

So this is a little bit more complicated. First, we define a function to let the script wait. As it’s written in the comment, that’s to avoid making a lot of requests to imgflip. Then, in the function, we retrieve the list of memes using the getMemeList function that we will implement soon, for each one we open the page related to the image (the image URL of the meme), wait 1 second (because we are nice) and then we check the headers of the response.

Puppeteer lets you inspect different parts of the requests for the page, we can access the headers using the headers method, or we can see the response by calling the buffer method. See more on the docs page.

The headers of the response will contain the content-type property, and it should be somehow related to an image (we opened the URL of the meme image), but by doing it I found that some images (at full size, we will talk about this in a moment) are nowhere to be found (or more probably, I’m looking for them in the wrong spot) and an XML error page is returned instead. This is why that it is there, if we have an XML response the image is not there and so we just ignore it. If not we call the ‘downloadImage’ function that will handle the download. After we just return to the previous page by calling ‘page.goBack’.

code to download the image from the response

This is pretty easy, we get the response body by calling buffer method, then write it on the file system, nothing fancy.

We just need to implement the function to retrieve all the memes on a page.

code to get the list of memes in the page

Let’s try to better understand this one:

We declare some new selectors that we will need in this function. In the function, first, we run ‘page.evaluate’. This is an API provided by puppeteer that lets you run a function inside the page context, this means that the callback that we provide is not run by Puppeteer in the node.js environment, but instead is executed by the browser that puppeteer maneuvers for us. This is the only way to get information that are inside the web page. To pass the selectors that we defined outside, we need to add them as parameters, they will be injected as parameters of the callback we define, so in our example MEME_BOX_SELECTOR will become boxSelector (see this small article).

The code inside the function is (terrible and) simple, we query the DOM for all the elements with the specified class, and for each of them we get the title (I replace all the white spaces with underscores and turn it to lowercase) and the imageId from the ‘href’, let’s dig a bit in that:

Imgflip seems to have a service to deliver the images cropped to different sizes (this is a common approach to give the best experience to users and to have less load on the services), if we inspect the URL of one of the images in that page we get something like https://i.imgflip.com/4/1ur9b0.jpg, i.imgflip.com is the domain, 4 is a crop (if you try https://i.imgflip.com/2/1ur9b0.jpg it will work too, but it returns a different image, a smaller one) and 1ur9b0 I assume is the id of the meme.

We obtain the Id and we set the imageUrl as https://i.imgflip.com/<imageId>" (not setting the crop is, I think, asking for the full-size image.

We then return a list of these objects that contains the imageUrl, the id and the sanitized title.

Conclusion

Beware that this is just a script that I’ve created to do some basic scraping, it’s not something ‘production ready’ or anything you should rely on. There are a lot of things that can be improved, but I was able to download all the meme templates using this (and learn something new by doing it) to have some fun with my friends.

You can find the entire script in this gist

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

--

--