
How leading AI labs structure their web crawl pipelines. We analyzed 30 open-source crawlers and surveyed 20 teams building proprietary LLM training datasets.
Methodology & test setup
This study ran 300,000 requests per proxy type across 500+ unique target domains over a 90-day period from January to March 2025. Targets were segmented into three tiers based on their anti-bot sophistication: Tier-1 (no protection), Tier-2 (basic fingerprinting), and Tier-3 (advanced fingerprinting + behavioral analysis).
All requests were made using a standardized headless Chromium instance with playwright-stealth patches applied. The only variable changed between test runs was the proxy configuration — type, rotation strategy, and pool size. Infrastructure was identical across all test groups.
# Test harness — simplified
import asyncio
import random
from dataclasses import dataclass, field
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
@dataclass
class TestConfig:
proxy_type: str # "mobile" | "residential" | "datacenter" | "isp"
rotation: str # "per_request" | "session" | "sticky_5min"
pool_size: int
targets: list[str] = field(default_factory=list)
async def run_test(config: TestConfig, n_requests: int = 1000) -> dict:
results = {"success": 0, "blocked": 0, "error": 0}
pool = get_proxy_pool(config.proxy_type, config.pool_size)
async with async_playwright() as p:
for _ in range(n_requests):
proxy = get_next_proxy(pool, config.rotation)
target = random.choice(config.targets)
try:
browser = await p.chromium.launch(proxy=proxy)
page = await (await browser.new_context()).new_page()
await stealth_async(page)
resp = await page.goto(target, wait_until="networkidle")
results["success" if resp.ok else "blocked"] += 1
await browser.close()
except Exception:
results["error"] += 1
return resultsKey findings
The data reveals a clear hierarchy: mobile carrier IPs consistently outperform every other proxy type across all target tiers. The gap widens dramatically at Tier-3 targets — where mobile proxies achieve a 99.4% success rate versus 62.1% for datacenter proxies.
Proxy type breakdown
Not all proxy types are equal. The chart below shows average success rates by proxy type across Tier-3 targets — the most demanding category. Mobile and residential proxies dominate; datacenter proxies fail on most sophisticated targets.
Success Rate by Proxy Type
Mobile proxies lead with a 99.4% average success rate across all target tiers
Fig 1. Average success rate across 500+ targets over 90 days (n=300,000 requests per proxy type). Tier-3 anti-bot targets only.
Target tier analysis
Tier classification matters enormously. At Tier-1 targets (no anti-bot), even datacenter proxies achieve 98%+ success. But as target sophistication increases, the gap between mobile and datacenter proxies grows from ~1% to over 37 percentage points.
- Tier-1: Static HTML, no bot detection. All proxy types achieve 95%+ success rates. Cost optimization is the primary driver here.
- Tier-2: Basic fingerprinting — JA3/JA4, User-Agent matching. Residential and mobile proxies maintain high success; datacenter begins to falter (~78%).
- Tier-3: Full behavioral analysis, TLS fingerprinting, and IP reputation scoring. Only mobile and residential proxies maintain acceptable success rates.
Cost-per-success analysis
Raw success rate tells only part of the story. When you factor in price-per-GB and normalize for successful requests only, mobile proxies are often cheaper than residential on Tier-3 targets — because fewer requests are wasted on blocks and retries.
For teams scraping Tier-3 targets at scale, switching from residential to mobile proxies reduced their effective cost-per-successful-request by 22% despite a higher nominal price-per-GB.
Recommendations
Based on our data, here is the recommended proxy strategy by use case:
- E-commerce scraping: Mobile proxies with per-request rotation. Most major retailers are now Tier-3 targets.
- Price monitoring: Residential with session rotation (5–10 req). Balances cost and success at Tier-2 targets.
- LLM training data: Residential at scale with per-request rotation. Volume matters more than individual success rate.
- Brand protection / SERP: Mobile proxies. Search engines are the strictest Tier-3 targets in our dataset.
Conclusion
The proxy landscape in 2025 is more stratified than ever. As anti-bot systems grow more sophisticated, the gap between proxy types has widened — making proxy selection a first-order infrastructure decision, not an afterthought.
Mobile carrier IPs are the clear leaders for demanding targets. For teams scaling web data collection, the shift to mobile proxies with per-request rotation delivers the highest success rates and, counterintuitively, a lower effective cost-per-successful-request at Tier-3 targets.