How we outperform traditional web scraping with 0 AI

Jan 25, 2022

I've seen a lot of community interest in web scraping technology for capturing online content. Many of you are aware of the difficulties with dynamic page loads, IP bans, page structure changes, non-browser user agents… the list goes on.

I thought I'd share one simple technique we used at Zenfetch to capture web content which gets around most, if not all, of these issues.

This is especially helpful if you conduct a lot of web scraping locally and selenium isn’t cutting it for you.

The problem with traditional scrapers

Traditional web scrapers, including ones we built at @Stripe for ToS violation detection, would make website requests from a server. This approach works by impersonating a browser client, gathering the static page's content, then running analysis on the retrieved information.

This might work for some, though it still falls short in several ways, namely you run a risk of bot detection if you haven’t gotten explicit buy-in from the source. Some cloud providers like AWS may even detect you are running a web scraping service and shut you down entirely.

One logical follow up might be setting up a selenium-based web scraper locally. Even so, there is still a severe limitation in bypassing authentication pages.

Many websites require authentication upfront to access content. Even with login credentials, it's unlikely you'll be able to detect the right fields where you should input said information, especially if the website changes their page layout.

This can lead to a lot of maintenance on the actual web scraping technology just to access the DOM.

Congratulations, you have officially entered the game of cat and mouse with web scraping 🥳

As an alternative, we wondered whether it would be possible to leverage our existing browser session which persists our own auth session.

Client side scraping

This exploration led us to using native client side web scrapers with chrome extensions.

Chrome extension web scrapers carry significantly lower risk of bot detection, handle dynamic page loads, leverage browser cookies to maintain logged in sessions, and aren't performing any user agent impersonation!

Combine chrome extensions with some of the tools listed below, and you’ll have a bona fide web scraper that scales to thousands of pages.

Avoiding thousands of tabs

Don't want to open thousands of tabs to scrape content, consider using chrome's offscreen API to render pages in hidden iframes which can then be used for DOM scraping. The chrome offscreen API even explicitly mentions DOM_SCRAPING as a use case for the iframes.

Scheduling the scrape

Manually activating chrome extension actions doesn't scale. We use chrome's external messaging APIs to signal to Zenfetch when it's time to scrape a large set of links such as when the user imports their bookmarks. You can also use the chrome extension alarms API if you’d like to scrape on a schedule.

Consider opening a persistent connection (such as with a web socket) to programmatically trigger web scraping in the browser.

Parsing the DOM

It's still a massive pain to parse HTML in the DOM. Consider using Mozilla's Readability library to capture the information you care about (turns out you don't always need LLMs to extract titles, authors, and text content from websites).

Want to capture web content for personal use?

If you’re looking for a way to capture the content you come across on the web, we’ve already done the heavy lifting of building a powerful web clipper which stores online information into a personal library powered by GPT-4.

You can get started with Zenfetch for free 🙂

Would love to hear feedback around UX and integrations you’d want. Feel free to send me an email gabe@zenfetch.com

‍