Accepted at IEEE S&P 2026

WebCloak: Characterizing and Mitigating the Threats of LLM-Driven Web Agents as Intelligent Scrapers

Visitors   License: CC BY-NC-SA 4.0   GitHub stars

Nanyang Technological University, Hong Kong Polytechnic University, §University of Hawaii at Manoa
Agent Mechanism

Mechanism of LLM-driven scrapers

WebCloak Teaser Image

Design overview of WebCloak

Multi-Modal Protection

WebCloak provides comprehensive protection against malicious scraping of various content types. Beyond visual assets, WebCloak also effectively protects text content and audio assets from being extracted by LLM-driven web agents, as demonstrated in the videos below.

Unprotected (Text + Image + Audio)

Note: LibriVox is a popular website for free recordings.

WebCloak (Text + Image + Audio)

 

Unprotected (Text + Image)

WebCloak (Text + Image)


Abstract

The rise of web agents powered by large language models (LLMs) is reshaping the landscape of human-computer interaction, enabling users to automate complex web tasks with natural language commands. However, this progress introduces serious, yet largely unexplored security concerns: adversaries can easily employ such web agents to conduct advanced web scraping, particularly of rich visual content. This paper presents the first systematic characterization of the danger represented by such LLM-driven web agents as intelligent scrapers. We develop LLMCrawlBench, a large test set of 237 extracted real-world webpages (10,895 images) from 50 popular high-traffic websites in 5 critical categories, designed specifically for adversarial image extraction evaluation. Our wide-ranging metrics across over 32 various scraper implementations, including LLM-to-Script (L2S), LLM-Native Crawlers (LNC), and LLM-based web agents (LWA), demonstrate that while some tools exhibit working issues, most sophisticated LLM-powered frameworks significantly lower the bar for effective scraping.

Such new agent-as-attacker threats motivate us to introduce WebCloak, an effective, lightweight defense that specifically targets the main weakness of LLM crawler agents' fundamental "Parse-then-Interpret" mechanism. Our key idea is dual-layered: (1) Dynamic Structural Obfuscation, which not only randomizes structural cues but also restores visual content client-side using non-traditional methods less amenable to direct LLM exploitation, and (2) Optimized Semantic Labyrinth to mislead the central LLM interpretation of the agent through added harmless-yet-misleading contextual clues, all while not sacrificing perfect visual quality for legitimate users. Our evaluations demonstrate that WebCloak significantly reduces scraping recall rates from 88.7% to 0% against leading LLM-driven scraping agents, offering a robust and practical countermeasure.


LLMCrawlBench Dataset

LLMCrawlBench is a comprehensive benchmark dataset featuring 237 real-world webpages with 10,895 high-quality images extracted from 50 popular high-traffic websites across 5 critical categories (Marketplaces, Social Media, News, Education, and Entertainment).

Originally designed to systematically evaluate LLM-driven web scraping threats and illicit visual asset extraction, LLMCrawlBench has the potential to support multiple research domains:

  • Web Agent Security & Robustness Testing: Benchmark LLM-driven agents' behavior on protected vs. unprotected content
  • Web IP Protection Research: Develop and evaluate novel defense mechanisms against automated scraping
  • LLM Behavior Analysis: Study how large language models interpret and interact with complex web structures
  • Crawler Detection & Defense: Train and test anti-bot systems using realistic adversarial scenarios
  • Web Structure Analysis: Analyze diverse HTML patterns and DOM structures from real production websites
  • Visual Content Extraction Evaluation: Assess computer vision and multimodal models on real-world webpage layouts
  • Adversarial ML Research: Create obfuscation techniques and study model robustness against perturbations
  • Accessibility & Web Standards: Evaluate how automated tools handle diverse web technologies and standards

Each webpage in the dataset is carefully annotated with ground-truth image locations, categories, and metadata. We are also expanding the dataset with additional large-scale text and audio content, which will be released soon.

Dataset

Pipeline

LLM-driven scraping agents often reply on webpage parsing and interpretation. To mitigate such evolving threats, we introduce dual-layer WebCloak with (1) dynamic obfuscation and (2) semantic labyrinth, as shown below. WebCloak aims to be a lightweight, “in-page” solution that transforms a standard webpage into a self-protecting asset, without relying on external tools or heavy server-side interventions.

Dynamic Structural Obfuscation

WebCloak S1

Optimized Semantic Labyrinth

WebCloak S2

Demonstration

As demonstrated by these videos below, after being protected by WebCloak, LLM-driven web agents can no longer extract any useful assets like images from the webpage, while the experience for real users remains intact.

Unprotected

WebCloak Protected