Web Scraping with JavaScript 101

How not to get blocked forever from the internet

How to Scrape Websites even if they block scrapers

A few months ago, I wanted to solve a problem with my phone service provider. However, customer support was unavailable (as always), and the website FAQ was hard to navigate.

I had an idea. What if I build my own LLM-based chat using their support pages?

The setup:

  • Next.js OpenAI Doc Search Starter

  • Puppeteer

  • Bright Data

The first step is setting up the starter repo. If you follow the official docs, that's pretty straightforward.

Then comes the web scraping part. I did choose Puppeteer for this. I heard a lot about Puppeteer at that time, and I also had little experience with browser automation. So I thought it’d be fun to try it.

Using PlayWright feels like coding in Scratch.

You give step-by-step instructions, and the headless browser will do exactly what you say—nothing less, nothing more. Here’s an example of opening a webpage and extracting all HTML elements’ text content:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com/');

  await page.setViewport({width: 1080, height: 1024});

  // Grab all text content from the page
const extractedText = await page.$eval('*', (el) => el.innerText);

// Log or save to file
    console.log(extractedText);

  await browser.close();
})();

But no one likes it if you scrape their data.

So, they build many different measures on their website to prevent you from doing so. Most “scrape-worthy” websites can easily detect bots and headless browsers based on User-Agent, usage patterns, captcha, challenges, and more.

That’s where Bright Data joins the chat.

Bright Data can provide you with a Scraping Browser or a proxy that you can use to “hide“ your bot acting like - well - a bot.

Data Power-Up with Bright Data

With Bright Data's proxy network, you can collect vast datasets at scale for Ecommerce, Travel, and Finance. Unlock actionable insights and enhance your service with our scalable solutions. Start with a free trial.

Connecting Bright Data to Puppeteer is way simpler than I initially thought. You need to register an account and set up a Proxy Manager or Scraping Browser. Then, connect the Puppeteer headless browser using the API keys you get from Bright Data, and they will take care of the rest.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    args: ['--proxy-server=127.0.0.1:24000']
  });

  const page = await browser.newPage();

  await page.goto('http://lumtest.com/myip.json');
  await page.screenshot({path: 'example.png'});

  await browser.close();
})();

Make vector embeddings to do vector search.

The last part of the work is to make the data searchable. For this, I chose Supabase, as they provide pgvector to use Postgres for vector embeddings (and a fully-fledged starter).

To create the embeddings, I saved the scraped data as HTML into .mdx files. The starter repo is already set up to generate embeddings from MDX files. I just needed to make sure that I used the right extension when writing the data from the scraper.

Then, I had my own, always-available chat-based customer support agent.

❓ Weekly Quiz

Curated

👉 Tools - Treasure Trove Finds

Tempo - A modern, ~5kB ESM alternative for date-fns and moment.js. Lately, it feels like FormKit ships a new open-source library every month.

MagicUI - Copy-paste trending components to build “Linear-like” websites with React, Tailwind, and Framer Motion.

Supabase - They like to refer to themselves as the open-source Firebase alternative, but it’s so much more!

Cal.com AI - Calcom is not just a Calendly alternative, but it’s also one of the most mature open-source Next.js apps. Now AI-enhanced.

👉 Binge-worthy

Bringing React Components to AI—Streaming UI components opens many possibilities. This video provides some examples and a short tutorial on how to make your own.

JavaScript Visualized - Promise Execution ▷ - Even after working with JavaScript for the past 12 years, I always like to watch some stuff about promises, closures, binding, and the JS this.

Until next time,
David

Reply

or to participate.