Scraping with JavaScript and Node.js

14.09.21 в 08:35 Interesting 1285

other advantage that speeds up the development process is support for npm (Node.js package manager), a package manager that provides access to a huge database of additional libraries, which simplifies the development process, including launching scraping bots.

Getting started with Node.js is pretty easy. It is specially designed to make it easier and faster for novice developers to master this technology. In the end, a basic knowledge of JavaScript is enough to start learning.

This article contains a guide to creating a simple project in Node.js that allows you to automatically collect information of interest from the target resource.

Creating a project and defining dependencies

Setting up a developing environment

To get started, you need to install node.js itself. Links to installation guides for all popular operating systems and the latest versions of the framework can be found on the official website. You can check if the installation is correct by requesting the version of node.js and npm from the terminal:

$ node -v
v14.15.5
$ npm -v
6.14.11

Additionally, you can prepare a text editor convenient for you to work with the code. Any editor will do, even a standard editor for your OS. However, for convenience, it is better to use one of the popular code editors, for example: VS Code, Sublime text, Atom, or WebStorm.

Initializing the project

Before creating a project, select or create a folder in which all the data necessary for the successful operation of the scraping bot will be stored. After that, launch a terminal from this folder and run the command to create a node.js project:

$ npm init -y
Wrote to /path.../package.json:
{
  "name": "scrape_proj",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

This command creates a package.json file in your project describing the basic configuration. This file can be supplemented by filling in the author, description and keywords fields yourself.

The project has been created, now we will install additional libraries that will greatly facilitate the launch and configuration of all the basic processes required for scraping. For this example, we will use the popular axios, cheerio and json2csv libraries. Each of the libraries will facilitate one of three main steps, which will be discussed in the next chapter.

Installing new libraries in node.js is very easy thanks to the package manager we downloaded earlier:

$ npm install axios cheerio json2csv

This command will create a node_modules folder in your project, into which all the packages necessary for the libraries to work will be loaded, and will also update the package.json file, saving information about the defined dependencies in it.

Three stages of web-scraping

In any project, including when using node.js, the implementation of the scraping process consists of three stages:

  1. Sending HTTP requests to the desired page and receiving a response from the server;
  1. Parsing the necessary data from the server response;
  1. Saving the received data in a convenient form (file, database);

The Axios library will help us to implement the first stage within the project. Other solutions can be found on the web, but Axios seems to be the most current and reliable alternative. With it, you can successfully generate HTTP requests and process responses from web servers.

In the second step, we will be using Cheerio. It is a powerful library that makes it easy to parse HTML pages. It will be especially useful for those who feel confident working with jQuery, because the syntax for accessing individual elements of a web page in Cheerio and jQuery is very similar.

The third step, when using JavaScript, is even simpler than the previous ones and can be easily achieved using json2csv. With it, we will save the parsing results in json format to a csv file.

Page analysis and choosing selectors

Most often, data is collected from certain online stores and other online trading sites. But for trial scraping, a site specially created to check the correctness of the user's programs on it is best suited.

These sites include Quotes to Scrape and Books to Scrape. These services are specially created so that anyone can try scraping without blocking or violating the rules of the service.

If you're wondering how legal and ethical web scraping is, you can read our article on this topic.

It is convenient to use CSS selectors to make it easier to parse the requested web pages. Using them, you can guaranteedly find an HTML element containing the required information on the studied page for any number of tag attachments.

It's pretty easy to see a selector for any element in your document. Let's take a look at the Quotes to Scrape home page title as an example.

  1. Right-click on the page title, in the drop-down menu, select "Inspect". This will open the developer tools section in your browser;
  2. In the Elements section, find the same heading inside the h1 tag and right-click on it again;
  3. In the menu that opens, select the Copy item;
  4. Choose Copy selector from the offered options;

The resulting selector will look like this:

body > div > div.row.header-box > div.col-md-8 > h1 >

Using such selectors is effective when parsing large, complex web pages. In some other cases, you can analyze the composition of the page and use a simpler selector. For example, in this case, there is only one h1 element on the main page. Therefore, there is no need to work with the volumetric selector in this case.

Practical examples of scraping with Node.js

Getting started simple: requesting a header

First, let's load the functionality of the external modules cheerio and axios into the main project file and save them in variables for further reference:

const cheerio = require("cheerio");
const axios = require("axios");

Let's save the address of the landing page into another variable:

const url = "https://quotes.toscrape.com/";

To send a GET request, the Axios library provides a get () method that runs asynchronously, so you must prefix await before calling it:

const response = await axios.get(url);

In this method, as the second argument, you can specify additional request parameters: http headers, authorization data, connection to a proxy, network protocol, server response timeout, and more. For example, a request through a proxy looks like this:

const response = await axios.get(url, {
      proxy: 
      {
          "host": "proxy-IP",
          "port": "proxy-port",
          "auth": {
              "username": "proxy-login",
              "password": "proxy-password",
          },
      }
});

To get only useful information from the response object - the requested web page without http-headers and response status, let's use the capabilities of the Cheerio library:

const $ = cheerio.load(response.data);

It is also very easy to parse the required data from an element by its selector:

const title = $("h1").text()

To see the result of the script's work, print the value of the title variable to the console:

console.log(title);

Working with HTTP requests can be associated with errors when there is a problem with the server on the side of the target site, restricting access to it or an unstable connection from the side of the node.js server. Therefore, the creation of a GET request must be wrapped in a try-catch block, which will allow you to print information about the type of error to the console and continue the program. As a result, the general program code will look like this:

const cheerio = require("cheerio");
const axios = require("axios");
const url = "https://quotes.toscrape.com/";
async function getTitle() {
  try {
    const response = await axios.get(url);
    const document = cheerio.load(response.data);
    const title = document("h1").text();
    console.log(title);
  } catch (error) {
    console.error(error);
  }
}
getTitle();

The code is ready, it remains to execute it with Node.js:

$ node index.js

As a result of executing the program, the site title will appear in the console:

Quotes to Scrape

Congratulations! You have successfully written your first JavaScript scraping program.

Retrieving product data

In most cases, web scraping is associated with the collection of current prices for certain services. For example, let's try to collect data on the price of books from the books.toscrape.com simulator site. We will take the data from the page with books of the Mystery genre.

To understand how to get data about books from this page, let's analyze its HTML code. Each book object is contained within the <article> tag. So you can use Cheerio and loop through all of these items to get the information you want. To start the loop, we will use the each () method. Simplified, the loop will look like this:

const books = $("article"); //Selector to get all books
books.each(function () 
           { //running a loop
        title = $(this).find("h3 a").text(); //extracting book title
        console.log(title);//print the book title
            });

It is easy to see that it will be more convenient to store the target information about the books in a separate array in JSON format while the loop is running.

Thus, the final code will look like this:

const cheerio = require("cheerio");
const axios = require("axios");
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
const books_data = [];
async function getBooks(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
    const books = $("article");
    books.each(function () {
      title = $(this).find("h3 a").text();
      price = $(this).find(".price_color").text();
      stock = $(this).find(".availability").text().trim();
      books_data.push({ title, price, stock }); //store in array
    });
    console.log(books_data);//print the array
  } catch (err) {
    console.error(err);
  }
}
getBooks(mystery);

All you need to do is save the code to a separate books.js file and run it with the node command:

$ node books.js

As a result of the code execution, all information about the books will be displayed in the console. However, here we only got books from one page. In the next chapter, we will figure out what to do if pagination is configured on the site and you need to load data from several pages at once.

Scraping pages with pagination

Online stores and other similarly structured websites use pagination when displaying a large number of objects. At the same time, the logic of dividing products into separate pages on different sites may differ. Therefore, the main nuance that needs to be taken into account is the boundary case of the last page on which the loop needs to be stopped.

The solution to this problem is quite simple: you need to check the value of the element containing the link to the next page using the same selector. If the link value is not empty, we recursively run our function with the new URL.

It is worth remembering that the element found during parsing contains a relative link. To get an absolute link to use as a function argument, you need to add a relative link to the variable containing the base URL. Our code uses the baseUrl variable for this.

Thus, to work with pagination in the each () loop, it is enough to add a new condition:

if ($(".next a").length > 0) {
      next_page = baseUrl + $(".next a").attr("href");
      getBooks(next_page); 
}

When the last page is reached, this condition will not be met and the program will exit. The last step is left - save the results to a file!

Writing data into a CSV file

It is inconvenient and impractical to output large amounts of data to the console. Because they cannot be used for further analysis. Therefore, it is customary to save data to a file or database. It is very easy to save data to a CSV file using JavaScript. We'll use the fs and json2csv libraries. FS is installed by default, and we have already installed json2csv earlier.

To use them in our code, we initialize them by storing their objects into variables:

const j2cp = require("json2csv").Parser;
const fs = require("fs");

We will call the following functions after the execution of the main part of the code, when the data has already been received and stored in an array. Their execution will just create a file and write the contents of the array into it:

const parser = new j2cp();
const csv = parser.parse(books_data);
fs.writeFileSync("./books.csv", csv);
const fs = require("fs");
const j2cp = require("json2csv").Parser;
const axios = require("axios");
const cheerio = require("cheerio");
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
const books_data = [];
async function getBooks(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
    const books = $("article");
    books.each(function () {
      title = $(this).find("h3 a").text();
      price = $(this).find(".price_color").text();
      stock = $(this).find(".availability").text().trim();
      books_data.push({ title, price, stock });
    });
    // console.log(books_data);
    const baseUrl = "http://books.toscrape.com/catalogue/category/books/mystery_3/";
    if ($(".next a").length > 0) {
      next = baseUrl + $(".next a").attr("href");
      getBooks(next);
    } else {
      const parser = new j2cp();
      const csv = parser.parse(books_data);
      fs.writeFileSync("./books.csv", csv);
    }
  } catch (err) {
    console.error(err);
  }
}
getBooks(mystery);

Run it as always from the console using Node.js:

$ node books.js

After successful completion of the program, the books.csv file will appear in the directory of your project, which can be opened with any editor that supports the csv format, for example, Microsoft Excel.

Conclusion

In this article, we have provided a simple example of using JavaScript for scraping and parsing data on the web. We used the Node.js framework and the Axios, Cheerio and Json2csv libraries.

If you would like to learn more about crawling, scraping, and website parsing, we have prepared other articles on this topic:

Comments

Sign in to comment
Popular

Do you need to use a proxy server to increase your anonymity on the Internet? Not sure how to set up a proxy properly before you start? In this article, we will try to answer all the questions that arise when you first try to connect to the network through a single proxy server on Windows 10.

Together with wide opportunities the Internet carries a number of dangers. First of all, when it comes to anonymity and security.

Initially, the World Wide Web was conceived as a space without borders, where you can get absolutely any information on an anonymous basis.

In today's world, it becomes more difficult to keep personal and corporate data in secret, so the issue of information security is becoming more acute every day.

In the recent past by the standards of the development of information technologies, in 2015 Google created artificial intelligence based on neural networks, which was able to analyze the condition around itself and draw conclusions about its further education. The name of the new offspring from Google was given in abbreviated from the term "deep Q-network" - DQN. The DQN started training in common arcade games (Pakman, Tennis, Space Invaiders, Boksing and other classics).

New

The Dolphin{anty} browser, which has made some noise in affiliate marketing, is a familiar tool for those who drive traffic through social networks or media and contextual advertising services. In the article we will talk about this antidetect and explain how to configure a proxy in it.

With the advent of Node.js, the development of JavaScript as one of the most powerful and user-friendly languages ​​for web scraping and data parsing has accelerated significantly. Node.js is one of the most popular and fastest growing software platforms. Its main purpose is to execute JavaScript code without the participation of a browser.

Receiving big volumes of data from websites for its following analysis plays a key role for plenty of projects. Target resource structure analysis and scraping of the relevant information are often connected to blocks or access restriction issues from website administration.

If you’re looking for a package of residential or mobile proxies with the ability to work with a particular country or ISP, the best option is definitely Exclusive Mix. With it you will be able to download the list which consists of proxies from preliminarily chosen countries and carriers, flexibly filtering it for your needs.

How to web scrape with python? It's a question that many beginners have. At the entry level, the process is quite simple, and anyone can quickly get their project off the ground. However, to successfully work on such a task, you should not forget about many aspects, which are not easy to understand at once.

Have you got any question?

Click here and we’ll answer