Crawling a Website’s HTML Text using JSDOM

Bhavya Dhiman
2 min readFeb 26, 2023

--

Jsdom is a very popular library to parse the HTML data from a website’s static html. I have used this library in my previous organisation for translating text from one language to another and it was able to convert multiple languages from one to another in one html content. I remember the time when client was very impressed saying “That’s Fantastic!”.

Crawling static text content from a website using jsdom is a piece of cake. I am showing you the example below:

const { default: axios } = require('axios');
const { JSDOM } = require('jsdom');

const fetchChildren = (children, count) => {
// count is used for checking the depth
for (const child of children) {
// print child's text content
console.log(child.textContent, count + 1);
if (child.children) {
fetchChildren(child.children, count + 1);
}
}
}
// getting the html content from a website such as google.
axios.get('https://www.google.com').then((data) => {
const { document } = new JSDOM(data.data).window;
// print a body's text content.
console.log(document.body.textContent);
// print childeren's text content recursively
fetchChildren(document.body.children, 0);
});

You can assign to a particular html tag or class or id by modifying the code as per requirement.

But wait. This process has 2 major cons as well.

  1. Higher the depth, more memory and the chances of memory stack getting full (rare case, in which html has very complicated tree structure)
  2. It is static, html can be changed dynamically from client end code or getting data from server side for a click of a button or any other event. That might not be captured in JSDOM html.

--

--

No responses yet