Diffbot
What is Diffbot?
Diffbot is an AI-powered web scraping and knowledge extraction platform based in Menlo Park, California. Founded in 2008, Diffbot uses machine learning and computer vision technologies to automatically extract and structure data from websites.
Diffbot functions as a sophisticated web crawler that can parse and understand web content similar to how humans do. Unlike traditional web scrapers that rely on predefined rules or patterns, Diffbot employs visual learning algorithms to recognize standard web page components (like articles, products, or people profiles) and extract structured data from them.
The technology identifies itself in server logs with the user-agent string Diffbot
and is designed to crawl websites to extract specific types of information based on the page type it detects. Diffbot's crawlers are programmed to understand web page layouts and can distinguish between primary content and supplementary elements like navigation, advertisements, and footers.
Diffbot is particularly distinctive in its ability to transform unstructured web content into structured data through its automatic extraction capabilities, making it valuable for organizations that need to systematically collect and analyze web information.
Why is Diffbot crawling my site?
Diffbot may be crawling your site for several reasons, primarily to extract and structure data for its clients who use the Diffbot API or services. These clients range from businesses conducting market research to developers building applications that require web data.
The bot typically looks for specific types of content depending on what Diffbot's clients are interested in, such as:
- Product information (prices, specifications, availability)
- News and article content
- Business and people profiles
- Job listings
- Discussion forum content
Crawling frequency varies based on the importance of your site to Diffbot's clients and how often your content changes. High-value or frequently updated sites may see more regular visits.
Diffbot's crawling is generally initiated when a client specifically requests data from your site or when Diffbot is building or updating its knowledge graph. While Diffbot is a legitimate service used by many businesses, its crawling is not always explicitly authorized by website owners, though it attempts to be a good web citizen by respecting standard crawling protocols.
What is the purpose of Diffbot?
Diffbot's core purpose is to transform the unstructured web into structured, machine-readable data that can be used for analysis, integration, and application development. It serves as an extraction and knowledge management system that helps organizations make sense of web content at scale.
The service supports various use cases including:
- Competitive intelligence gathering
- Product and pricing monitoring
- Content aggregation and syndication
- Research and data analysis
- Building knowledge graphs and databases
- Powering AI applications that require web data
The data collected by Diffbot is processed, structured, and made available to its clients through APIs and other services. While website owners don't directly benefit from Diffbot's crawling, the technology does contribute to the broader web ecosystem by making information more accessible and usable.
For websites being crawled, the primary consideration is the increased server load from Diffbot's visits, though the company aims to maintain reasonable crawl rates to minimize impact.
How do I block Diffbot?
Diffbot respects the robots.txt protocol, making it relatively straightforward to control its access to your site. To block Diffbot completely, add these directives to your robots.txt file:
User-agent: Diffbot
Disallow: /
To block Diffbot from specific sections of your site:
User-agent: Diffbot
Disallow: /private-section/
Disallow: /members-only/
Allow: /
If you prefer more granular control, you can implement HTTP response headers or meta tags to specify how Diffbot should handle specific pages. For instance, adding a "nofollow" directive to links you don't want crawled.
For websites using server-side controls, you can identify Diffbot by its user-agent string and implement conditional access rules.
If you're experiencing excessive crawling that's impacting site performance, you can also reach out to Diffbot directly to discuss crawl rate adjustments.
Blocking Diffbot will prevent your content from being included in their knowledge graph and any services their clients use. This may reduce your content's visibility in certain contexts but will also decrease the server load from their crawling activities.
Operated by
Data collector
Documentation
Go to docsAI model training
Acts on behalf of user
Obeys directives
User Agent
Diffbot