Dynamic Structural Obfuscation

WebCloak provides comprehensive protection against malicious scraping of various content types. Beyond visual assets, WebCloak also effectively protects text content and audio assets from being extracted by LLM-driven web agents, as demonstrated in the videos below.
Note: LibriVox is a popular website for free recordings.
The rise of web agents powered by large language models (LLMs) is reshaping the landscape of human-computer interaction, enabling users to automate complex web tasks with natural language commands. However, this progress introduces serious, yet largely unexplored security concerns: adversaries can easily employ such web agents to conduct advanced web scraping, particularly of rich visual content. This paper presents the first systematic characterization of the danger represented by such LLM-driven web agents as intelligent scrapers. We develop LLMCrawlBench, a large test set of 237 extracted real-world webpages (10,895 images) from 50 popular high-traffic websites in 5 critical categories, designed specifically for adversarial image extraction evaluation. Our wide-ranging metrics across over 32 various scraper implementations, including LLM-to-Script (L2S), LLM-Native Crawlers (LNC), and LLM-based web agents (LWA), demonstrate that while some tools exhibit working issues, most sophisticated LLM-powered frameworks significantly lower the bar for effective scraping.
Such new agent-as-attacker threats motivate us to introduce WebCloak, an effective, lightweight defense that specifically targets the main weakness of LLM crawler agents' fundamental "Parse-then-Interpret" mechanism. Our key idea is dual-layered: (1) Dynamic Structural Obfuscation, which not only randomizes structural cues but also restores visual content client-side using non-traditional methods less amenable to direct LLM exploitation, and (2) Optimized Semantic Labyrinth to mislead the central LLM interpretation of the agent through added harmless-yet-misleading contextual clues, all while not sacrificing perfect visual quality for legitimate users. Our evaluations demonstrate that WebCloak significantly reduces scraping recall rates from 88.7% to 0% against leading LLM-driven scraping agents, offering a robust and practical countermeasure.
LLMCrawlBench is a comprehensive benchmark dataset featuring 237 real-world webpages with 10,895 high-quality images extracted from 50 popular high-traffic websites across 5 critical categories (Marketplaces, Social Media, News, Education, and Entertainment).
Originally designed to systematically evaluate LLM-driven web scraping threats and illicit visual asset extraction, LLMCrawlBench has the potential to support multiple research domains:
Each webpage in the dataset is carefully annotated with ground-truth image locations, categories, and metadata. We are also expanding the dataset with additional large-scale text and audio content, which will be released soon.
LLM-driven scraping agents often reply on webpage parsing and interpretation. To mitigate such evolving threats, we introduce dual-layer WebCloak with (1) dynamic obfuscation and (2) semantic labyrinth, as shown below. WebCloak aims to be a lightweight, “in-page” solution that transforms a standard webpage into a self-protecting asset, without relying on external tools or heavy server-side interventions.
As demonstrated by these videos below, after being protected by WebCloak, LLM-driven web agents can no longer extract any useful assets like images from the webpage, while the experience for real users remains intact.