The war over data privacy just got messier. Cloudflare has accused Perplexity AI, a fast-growing AI-powered search engine, of secretly crawling websites to collect data, even when those websites have explicitly instructed bots to stay out. According to Cloudflare, Perplexity has been disguising its identity, rotating IP addresses, and ignoring robots.txt files, which are standard tools websites use to say “do not scrape my content.”
This means Perplexity might be accessing and using information without permission, raising significant concerns about how AI companies collect user data and whether they’re adhering to the rules of the internet.
The Core Allegation
According to Cloudflare, Perplexity initially identifies itself correctly when crawling sites. However, when faced with network blocks or restrictions via robots.txt files, it allegedly switches tactics by:
- Modifying its user agent to disguise crawling activity
- Rotating IP addresses and ASNs to bypass restrictions
- Using undeclared crawlers in addition to its public bots (PerplexityBot and Perplexity-User)
- Ignoring or, in some cases, failing to even request robots.txt directives
Evidence from Cloudflare’s Investigation
Cloudflare claims it launched an investigation after receiving multiple complaints from customers who had explicitly prohibited Perplexity’s crawlers through robots.txt and Web Application Firewall (WAF) rules. Despite these measures, customers reported that Perplexity continued accessing their content.
To verify, Cloudflare:
- Created controlled test environments using brand-new domains, implementing strict robots.txt rules to block all bots
- Observed that Perplexity’s bots still retrieved restricted content
- Detected attempts by Perplexity to impersonate a generic browser agent, mimicking Google Chrome on macOS when its declared crawler was blocked
- Traced undeclared crawlers using machine learning and network signal analysis across tens of thousands of domains and millions of requests per day
The Technical Breakdown
Cloudflare’s findings show that Perplexity’s undeclared crawlers were:
- Using IP addresses outside Perplexity’s official IP range
- Rotating through these IPs and switching ASNs to avoid detection
- Conducting large-scale scraping, described as across tens of thousands of domains and millions of requests per day
In addition, Cloudflare reports that Perplexity continued providing detailed responses about restricted test domains, even though they were explicitly blocked.
Why This Matters
For decades, the internet has operated on an implicit foundation of trust between site owners and automated crawlers. Protocols like robots.txt exist to balance functionality and fairness, ensuring that sites can manage automated access without resorting to aggressive measures.
Cloudflare’s statement underscores this principle:
“The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust.”
Violations of this trust, the company warns, undermine the principles that allow the web and, by extension, AI systems built on top of it to function transparently.
The Broader Implications
This controversy isn’t just about one company. It raises broader questions about:
- Ethical AI Development: Should AI companies honor standard web protocols, or is aggressive data acquisition a necessary evil in the race for better models?
- Data Ownership and Consent: Who controls the content that AI scrapes, and how should consent be enforced?
- Industry Regulation: Will this prompt calls for stronger governance and legal frameworks around AI-driven crawling?
Perplexity’s Response?
As of this writing, Perplexity has not issued an official statement addressing Cloudflare’s claims. The company, known for its rapid rise as a conversational AI competitor, now faces scrutiny not only from the tech community but potentially from regulators concerned with compliance and data ethics.
Bottom Line
The Cloudflare-Perplexity standoff signals the beginning of a larger battle over how AI companies acquire data, and whether transparency will remain a cornerstone of the internet or become collateral damage in the AI arms race.