Web Scraping and Next.js

Many sites depend on data scraped from external pages. Despite this task being so common, there is no standard way to approach it and depending on our approach we face different problems. In this post we'll see how Next.js and incremental static generation provide a great solution.

Web Scraping Data for Content

Many sites depend on data scraped from external pages. Whether we need to gather listings for jobs or real estate ads, collect and summarize reviews, or compare prices, the number of possible use cases found on the web is nearly endless. Despite this task being so common, there is no standard way to approach it and depending on our approach we face different problems. In this post we'll see how Next.js and incremental static generation provide a great solution to this.

Approaches to Web Scraping

When choosing an approach for web scraping, we have to consider and balance a number of factors.

  • Timeliness: Do we need current data or is it okay if our data is a few seconds/minutes/hours/days/weeks old?
  • Request Count: How many requests can we send to the scraped site(s)? Will we run into problems when calling external sites too frequently?
  • Performance: What is the performance impact on our site, if we need to fetch external data?
  • Complexity: How complex is our solution?

There are 3 commonly found approaches to scrape and display content. They differ by when the external data is fetched and how it is stored:

Scrape...

  1. On every request.
  2. Once and build the site.
  3. Periodically, store data, fetch from site.

Let's look at each of these in detail.

1. Scrape on every request

In this approach each client requests results in our web application needs to fetch and process external data, before returning the results.

As a result, data displayed to the users of our application will always be up to date. However, scraping external pages on each request results in drawbacks making this approach infeasible in most cases.

Scraping a pages requires time, which will significantly slow down our page. In case our site experiences a lot of traffic we would call the external page too often and run the risk of getting blocked. Even in cases where we are not blocked, the external page could currently be down. Resulting in our page being unable to fulfill the request.

2. Scrape once and build

In this approach we scrape the target sites once and use the results to build our page. As a result the number of requests to the external pages are minimal and our page can be served quickly. However, one problem we face in this case is that data displayed on our site can be outdated after some time. To solve this problem we would have to periodically scrape the targets and then rebuild and redeploy our site.

3. Scrape periodically, store locally

Similar to what we do in the second approach, we could scrape the targets only periodically, but instead of rebuilding the page, we would store the results (either in a database or by any other means). Our page would then use this data to build its contents (either on the server-side or by an API call).

While being the most flexible approach and quite performant, we face the problem of high complexity and lots of moving parts in this.

How Next.js can help us

So how can we do better? And how can Next.js help us to do better?

The answer lies in Next.js's getStaticProps method.

In its base form this method gets called on build time of a page, however Next.js 9.5 introduced Incremental Static Generation and we can make use of the revalidate attribute.

Setting a revalidate time in seconds allows us to tell our Next.js application in which intervals getStaticProps should be recomputed. So if we perform our scraping calls in this method, we have a built in mechanism to periodically re-scrape and re-generate our pages!

export async function getStaticProps() {
  // call our scraper function
  const scrapedData = ourScraper.scrape()
  return {
    props: {
      scrapedData,
    },
    revalidate: 3600, // rerun scraping every hour (3600 seconds)
  }
}

scrape in getStaticProps and revalidate

Example

For our example let us display the latest comic title from xkcd on our page. We will use cheerio to parse the external HTML we fetch via axios.

Let's start with a Next.js application and a page like the following:

import styles from '../styles/Home.module.css'
export default function Home(props) {
  return (
    <div className={styles.container}>
      <main className={styles.main}>
        <div>Latest Comic: {`title`}</div>
        <div>Last scraped: {`date`}</div>
      </main>
    </div>
  )
}

Since we need to fetch and parse the external page, we want to add both axios and cheerio.

yarn add axios cheerio

With these libraries we can create our getStaticProps() function:

import styles from '../styles/Home.module.css'
import cheerio from 'cheerio'
import axios from 'axios'

export default function Home(props) {
  // ...
}

export async function getStaticProps() {
  const { data } = await axios.get('https://xkcd.com/')
  const $ = cheerio.load(data)
  const title = $('#ctitle').text()
  const lastScraped = new Date().toISOString()
  return {
    props: { title, lastScraped },
    revalidate: 10, // rerun after 10 seconds
  }
}

Here we fetch the HTML from https://xkcd.com and then parse it using cheerio to extract the title from the comic. Both title and the date of the function's last invocation are then passed to the page as props.

revalidate is a property used by Next.js and tells the framework to regard our props as "stale" after the specified amount of seconds and to rerun getStaticProps() to generate a new version of our page.

With this in place we can finally use and display the properties in our page:

import styles from '../styles/Home.module.css'
import cheerio from 'cheerio'
import axios from 'axios'

export default function Home(props) {
  return (
    <div className={styles.container}>
      <main className={styles.main}>
        <div>Latest Comic: {props.title}</div>
        <div>Last scraped: {props.lastScraped}</div>
      </main>
    </div>
  )
}

export async function getStaticProps() {
  const { data } = await axios.get('https://xkcd.com/')
  const $ = cheerio.load(data)
  const title = $('#ctitle').text()
  const lastScraped = new Date().toISOString()
  return {
    props: { title, lastScraped },
    revalidate: 10,
  }
}

Now if we build (yarn build) and start (yarn start) our page and refresh it over time. We will see that the lastScraped date never changes earlier than 10 seconds.

Note

If we run Next.js in development mode, getStaticProps() is called for each request. Incremental Static Generation is only enabled in production mode.