LLM Training Data Pipeline Study: Proxy Usage & Web Crawl Efficiency

How leading AI labs structure their web crawl pipelines. We analyzed 30 open-source crawlers and surveyed 20 teams building proprietary LLM training datasets.

Methodology & test setup

This study ran 300,000 requests per proxy type across 500+ unique target domains over a 90-day period from January to March 2025. Targets were segmented into three tiers based on their anti-bot sophistication: Tier-1 (no protection), Tier-2 (basic fingerprinting), and Tier-3 (advanced fingerprinting + behavioral analysis).

All requests were made using a standardized headless Chromium instance with playwright-stealth patches applied. The only variable changed between test runs was the proxy configuration — type, rotation strategy, and pool size. Infrastructure was identical across all test groups.

python

# Test harness — simplified
import asyncio
import random
from dataclasses import dataclass, field
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

@dataclass
class TestConfig:
    proxy_type: str          # "mobile" | "residential" | "datacenter" | "isp"
    rotation:   str          # "per_request" | "session" | "sticky_5min"
    pool_size:  int
    targets:    list[str] = field(default_factory=list)

async def run_test(config: TestConfig, n_requests: int = 1000) -> dict:
    results = {"success": 0, "blocked": 0, "error": 0}
    pool = get_proxy_pool(config.proxy_type, config.pool_size)

    async with async_playwright() as p:
        for _ in range(n_requests):
            proxy = get_next_proxy(pool, config.rotation)
            target = random.choice(config.targets)
            try:
                browser = await p.chromium.launch(proxy=proxy)
                page = await (await browser.new_context()).new_page()
                await stealth_async(page)
                resp = await page.goto(target, wait_until="networkidle")
                results["success" if resp.ok else "blocked"] += 1
                await browser.close()
            except Exception:
                results["error"] += 1
    return results

Key findings

The data reveals a clear hierarchy: mobile carrier IPs consistently outperform every other proxy type across all target tiers. The gap widens dramatically at Tier-3 targets — where mobile proxies achieve a 99.4% success rate versus 62.1% for datacenter proxies.

99.4%Mobile proxy success ratevs. 87.2% for residential

34×Lower IP block ratewith per-request rotation

Proxy type breakdown

Not all proxy types are equal. The chart below shows average success rates by proxy type across Tier-3 targets — the most demanding category. Mobile and residential proxies dominate; datacenter proxies fail on most sophisticated targets.

Success Rate by Proxy Type

Mobile proxies lead with a 99.4% average success rate across all target tiers

Fig 1. Average success rate across 500+ targets over 90 days (n=300,000 requests per proxy type). Tier-3 anti-bot targets only.

Target tier analysis

Tier classification matters enormously. At Tier-1 targets (no anti-bot), even datacenter proxies achieve 98%+ success. But as target sophistication increases, the gap between mobile and datacenter proxies grows from ~1% to over 37 percentage points.

Tier-1: Static HTML, no bot detection. All proxy types achieve 95%+ success rates. Cost optimization is the primary driver here.
Tier-2: Basic fingerprinting — JA3/JA4, User-Agent matching. Residential and mobile proxies maintain high success; datacenter begins to falter (~78%).
Tier-3: Full behavioral analysis, TLS fingerprinting, and IP reputation scoring. Only mobile and residential proxies maintain acceptable success rates.

Cost-per-success analysis

Raw success rate tells only part of the story. When you factor in price-per-GB and normalize for successful requests only, mobile proxies are often cheaper than residential on Tier-3 targets — because fewer requests are wasted on blocks and retries.

For teams scraping Tier-3 targets at scale, switching from residential to mobile proxies reduced their effective cost-per-successful-request by 22% despite a higher nominal price-per-GB.

Recommendations

Based on our data, here is the recommended proxy strategy by use case:

E-commerce scraping: Mobile proxies with per-request rotation. Most major retailers are now Tier-3 targets.
Price monitoring: Residential with session rotation (5–10 req). Balances cost and success at Tier-2 targets.
LLM training data: Residential at scale with per-request rotation. Volume matters more than individual success rate.
Brand protection / SERP: Mobile proxies. Search engines are the strictest Tier-3 targets in our dataset.

Conclusion

The proxy landscape in 2025 is more stratified than ever. As anti-bot systems grow more sophisticated, the gap between proxy types has widened — making proxy selection a first-order infrastructure decision, not an afterthought.

Mobile carrier IPs are the clear leaders for demanding targets. For teams scaling web data collection, the shift to mobile proxies with per-request rotation delivers the highest success rates and, counterintuitively, a lower effective cost-per-successful-request at Tier-3 targets.

Back to Research