Build a web crawler with Queues and Browser Rendering
Last reviewed: about 2 months ago
Example of how to use Queues and Browser Rendering to power a web crawler.
This tutorial explains how to build and deploy a web crawler with Queues, Browser Rendering, and Puppeteer.
Puppeteer is a high-level library used to automate interactions with Chrome/Chromium browsers. On each submitted page, the crawler will find the number of links to cloudflare.com and take a screenshot of the site, saving results to Workers KV.
You can use Puppeteer to request all images on a page, save the colors used on a site, and more.
Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.
1. Create new Workers application
To get started, create a Worker application using the create-cloudflare CLI ↗. Open a terminal window and run the following command:
Then, add a Browser Rendering binding. Adding a Browser Rendering binding gives the Worker access to a headless Chromium instance you will control with Puppeteer.
4. Set up a Queue
Now, we need to set up the Queue.
Add Queue bindings to wrangler.toml
Then, in your wrangler.toml file, add the following:
Adding the max_batch_timeout of 60 seconds to the consumer queue is important because Browser Rendering has a limit of two new browsers per minute per account. This timeout waits up to a minute before collecting queue messages into a batch. The Worker will then remain under this browser invocation limit.
Your final wrangler.toml file should look similar to the one below.
5. Add bindings to environment
Add the bindings to the environment interface in src/index.ts, so TypeScript correctly types the bindings. Type the queue as Queue<any>. The following step will show you how to change this type.
6. Submit links to crawl
Add a fetch() handler to the Worker to submit links to crawl.
This will accept requests to any subpath and forwards the request’s body to be crawled. It expects that the request body only contains a URL. In production, you should check that the request was a POST request and contains a well-formed URL in its body. This has been omitted for simplicity.
7. Crawl with Puppeteer
Add a queue() handler to the Worker to process the links you send.
This is a skeleton for the crawler. It launches the Puppeteer browser and iterates through the Queue’s received messages. It fetches the site’s robots.txt and uses robots-parser to check that this site allows crawling. If crawling is not allowed, the message is ack’ed, removing it from the Queue. If crawling is allowed, you can continue to crawl the site.
The puppeteer.launch() is wrapped in a try...catch to allow the whole batch to be retried if the browser launch fails. The browser launch may fail due to going over the limit for number of browsers per account.
This helper function opens a new page in Puppeteer and navigates to the provided URL. numCloudflareLinks uses Puppeteer’s $$eval (equivalent to document.querySelectorAll) to find the number of links to a cloudflare.com page. Checking if the link’s href is to a cloudflare.com page is wrapped in a try...catch to handle cases where hrefs may not be URLs.
Then, the function sets the browser viewport size and takes a screenshot of the full page. The screenshot is returned as a Buffer so it can be converted to an ArrayBuffer and written to KV.
To enable recursively crawling links, add a snippet after checking the number of Cloudflare links to send messages recursively from the queue consumer to the queue itself. Recursing too deep, as is possible with crawling, will cause a Durable Object Subrequest depth limit exceeded. error. If one occurs, it is caught, but the links are not retried.
Then, in the queue handler, call crawlPage on the URL.
This snippet saves the results from crawlPage into the appropriate KV namespaces. If an unexpected error occurred, the URL will be retried and resent to the queue again.
Saving the timestamp of the crawl in KV helps you avoid crawling too frequently.
Add a snippet before checking robots.txt to check KV for a crawl within the last hour. This lists all KV keys beginning with the same URL (crawls of the same page), and check if any crawls have been done within the last hour. If any crawls have been done within the last hour, the message is ack’ed and not retried.
The final script is included below.
8. Deploy your Worker
To deploy your Worker, run the following command:
You have successfully created a Worker which can submit URLs to a queue for crawling and save results to Workers KV.
To test your Worker, you could use the following cURL request to take a screenshot of this documentation page.