Blocking Web Scrapers: A Realist’s Approach

TLDR

You can’t reliably block all bots: even basic scrapers can mimic browsers well enough to get through.
Defense is about friction, not perfection: aim to make scraping slow, expensive, and annoying - not impossible. Most attackers give up when it stops being worth the effort.
Defend in layers: block as early in the stack as possible. Each layer gets more expensive to defend.
Some scrapers will always win: especially those using headless browsers and residential IP pools. But most move on when friction or cost gets high.

Introduction

For nearly two years, I ran a side project powered by a long-running, 24/7 scraping job. It targeted a well-known marketplace that I won’t name here. What matters is that the company I was scraping had significantly more resources than I did and employed far more experienced engineers. It also had every incentive to block my attempts, because access to their dataset made it clear they weren’t exactly what they claimed to be.

Yet despite their efforts, I built a scraping infrastructure from scratch with no prior experience. Ran my own proxy pool, with an army of headless browsers and kept the operation running while interning full-time, all for under $50/month.

This isn’t a humblebrag. It’s the point of the story.

Web scraping at scale isn’t as difficult as it may seem. If you have enough time or money, nearly any site can be scraped. LinkedIn is a perfect example. Despite aggressive use of login walls, and CAPTCHAs there are countless services whose entire business model depends on scraping LinkedIn data.

This post isn’t about novel ways to scrape or block scrapers. It’s a practical guide from someone who has operated on both sides. Now that I find myself trying to prevent scraping, I’ve realized how useful that prior experience really was. Most people don’t get to see both sides, and that perspective makes a real difference when you're on defense.

Two core principles

You can’t reliably block scrapers.

And there is plenty evidence of this. A lot of successful services are running on scraped data from pretty big and well-funded companies.

You can scrape any website. The only real constraints are cost and time. Go slow, and you can stay under the radar with minimal infrastructure. Go fast, and you'll need to invest in a more advanced, expensive setup.

So the real goal isn’t to block scrapers. It’s to make scraping your service expensive and time-consuming.

Defense is harder than attack.

As an attacker, your feedback loop is tight. Your scraper breaks, and you immediately see the errors. You can quickly diagnose what’s wrong and adjust. Plus, their frontend is running in your browser, so you see exactly how things behave.

As a defender, on the other hand, you have a much harder job. You first need to detect the scraping, which means building and maintaining monitoring systems. Then you must craft and deploy defenses carefully, ensuring you don’t block legitimate users in the process. So have a monitoring system in place early on.

Understanding automated traffic

It helps to distinguish between two types of automated traffic.

The first one is vulnerability scanners. These bots scan for known exploits: .env files, admin panels, leaked API keys, etc. They hit your server in short bursts at high RPS. They’re noisy, but mostly harmless. A modern server can shrug off thousands of 404s per second. The main cost is bandwidth and log volume.


  2025-07-10 18:14:31.000 13.**.**.*** - - [10/Jul/2025:17:14:31 +0000] "GET /wp-admin/css/admin.php HTTP/1.1" 301 178 "-" "-"
  2025-07-10 18:14:31.000 13.**.**.*** - - [10/Jul/2025:17:14:31 +0000] "GET /Marvins.php HTTP/1.1" 301 178 "-" "-"
  2025-07-10 18:14:31.000 13.**.**.*** - - [10/Jul/2025:17:14:31 +0000] "GET /randkeyword.php HTTP/1.1" 200 15886 "-" "-"

Example: Log extract from a short-lived vulnerability scan.

Scraper traffic is a different beast. These are the real problem. They run continuously, often for days, harvesting structured data from your API or pages. They consume bandwidth, CPU, memory, and database I/O. On small servers, they can cause latency spikes or downtime. On the cloud, they rack up egress and query costs. And most importantly, they leak your data.

How to block scraper traffic ?

Or more precisely: how to make scraping your service slow, expensive, and frustrating.
We'll go through this in layers. Starting at the earliest points of contact and moving deeper into the stack.

Level 0: Anti-scraping by design

Scraping happens in two phases:

Breadth-first discovery: the attacker needs to list all available entities in your service. Let's say you run a marketplace. The first phase will be to get all available listings on your platform. The scraper will do a lot of broad searches.
Depth-first extraction: once the entities are listed, the scraper will pull detailed data from each one. Sticking with the marketplace example, now that they have the listing URLs, they’ll start scraping the actual product details.

So your first line of defense is at the product level. Design your system in a way that's hard to scrape.

Dunster is intense. Not only there is no public directory but there is not directory at all. You have to do searches and if your search return more than 20 matches nothing gets returned. And once you do get results they don't link directly to the images they link to a PHP that redirects or something. Weird. This may be difficult I'll come back later.

— Paraphrased from the hacking scene in The Social Network (2010)

Make discovery painful:

Don’t expose full indexes or public directories.
Disable or severely limit pagination.
For user facing identifiers, use UUIDs instead of sequential IDs. Sqids are a good alternative, which are short enough to be used in URLs without being easily discoverable
Avoid exposing advanced query filters in search. Instead use abstract query tokens like query=... that you'll resolve or interpret server-side.

Then make depth painful:

Return only essential fields on item views.
Avoid general-purpose endpoints that leak full objects.

In my case, the discovery design choices made full sweeps nearly impossible. Scrapers ended up trying things like searching for every possible letter combination (like a, aa, ab, ac, ...). But the huge number of nonsensical searches made them really easy to detect, and it didn't even guarantee that they'd end up with a significant portion of the index, even if left alone.

Level 1: Network level (CDN / Proxy)

The best way to stop malicious traffic is to prevent it from reaching your server in the first place.

A CDN or reverse proxy like Cloudflare acts as your first line of defense. It hides your origin IP and lets you configure WAF rules for more advanced protections. Most of them include built-in bot-blocking features, though these are often tied to paid plans and can be imprecise.

The goal isn't sophistication, it’s elimination. Most will hit basic network limits and move on.

There's a critical setup detail: once traffic flows through a proxy, your web server only sees the CDN's IP addresses. That breaks rate limiting and IP-based blocking. You need to configure your server to trust the proxy and extract real client IPs from headers like X-Forwarded-For.

Level 2: Web server level

At the web server level, you're catching what slipped through the CDN. The key is handling obvious patterns without complex logic.

Many scrapers start sloppy, hitting endpoints from a single IP, with no user-agent spoofing, no delay between requests. Those are easy wins. If your server logs are reasonably set up, you’ll start noticing these trends fast.

Some defenses are:

Block requests missing common headers (User-Agent, Accept).
Rate limit by IP at the connection level.
Set up fail2ban rules based on 404 patterns or request frequency.

I don't stack too many rules here. Once a request passes basic checks, I let the application layer handle nuanced detection. Still, it’s worth handling the low-hanging fruits early.

Level 3: Application level

At the application level you finally get full context and can implement more complex defences.

The foundation is rate limiting with multiple time windows. Set both burst limits (number of requests per hour) and sustained limits (number of requests per day). This creates a dilemma for attackers: either slow their crawl to a month-long operation, or burn through expensive IP addresses and accounts every day.

Well-funded scrapers with a big enough pool of user accounts or IP addresses can still operate below your thresholds. But for most situations, this is enough to filter out a big chunk of automated traffic.

There are other techniques that can help spot bots: browser fingerprinting, behavioral analysis, anomaly detection. But these are complex topics, often with high maintenance cost, and they’re outside the scope of this post.

CATPCHAs can also be used. They can introduce just enough friction to make scraping less attractive, but they come at the cost of user experience, so they should be used sparingly and strategically.

Why it's not enough

With this multi-layered strategy, you're already ahead of most websites. Yet it's still not enough to completely stop scrapers, only slow them down. That may sound defeatist, but it's realistic

Still, raising the cost of scraping, in time, money, or effort, is often enough. Many scrapers move on the moment they hit resistance. Think less about automated perfection, and more about friction. Defend in layers, monitor actively, and make life harder for the lazy bots. That’s usually enough.

Blocking web scrapers: A realist’s approach