What is news-please bot?

What is news-please?

news-please is an open-source news crawler and information extractor specifically designed for news articles. Developed by researchers from the University of Konstanz in Germany, it was first released in 2017. The tool is available on GitHub under an Apache 2 license. news-please is technically classified as a combined web crawler and content extractor that automates the collection and processing of news data from online sources.

The system works by crawling news websites to download articles' HTML content, then extracting structured information from those articles including title, lead paragraph, main content, publication date, author, and main image. news-please combines the results of multiple state-of-the-art extractors to achieve higher extraction quality than individual extractors. It identifies itself in user-agent strings as news-please.

A distinctive feature of news-please is its ability to perform full website extraction with minimal configuration—requiring only the root URL of a news outlet. It supports multiple crawling techniques including RSS feed analysis, recursive link following, sitemap analysis, and an automatic mode that combines these approaches. This allows it to retrieve both historical articles and newly published content.

Why is news-please crawling my site?

If you notice news-please crawling your website, it's most likely because your site publishes news content that researchers, data scientists, or analysts are interested in collecting for analysis. news-please typically targets news websites and looks for articles containing text content, publication dates, author information, and images.

The frequency of visits depends entirely on how the tool has been configured by its user. It might perform a one-time crawl to collect historical data, or it could be set up to regularly monitor your site for new articles. The crawling is triggered by the specific needs of the researcher or organization using the tool, who would have pointed news-please at your website's root URL.

It's important to note that news-please is a tool that can be used by anyone, so crawling may be authorized or unauthorized depending on who is using it and for what purpose. Academic researchers often use it for legitimate research purposes, but the tool itself doesn't enforce any particular usage policy.

What is the purpose of news-please?

news-please serves as a data collection and extraction tool primarily for research purposes. It supports various disciplines such as social sciences, linguistics, and media studies by enabling researchers to compile comprehensive datasets of news articles. The tool addresses the challenge of large-scale news data collection, which was previously cumbersome due to a lack of generic tools for crawling and extracting such data.

The collected data is typically used for research analyses, such as studying media framing, content analysis, or tracking how certain topics are covered across different news outlets. Unlike commercial crawlers, news-please is designed with academic and research needs in mind, offering a straightforward way to gather structured news data without requiring extensive technical expertise.

For website owners, news-please generally doesn't provide direct value, as it's primarily a data collection tool rather than a service that drives traffic or engagement. However, the research facilitated by news-please may contribute to broader understanding of media coverage and news dissemination patterns.

How do I block news-please?

news-please respects the robots.txt protocol, which means you can control its access to your website by adding appropriate directives to your robots.txt file. To completely block news-please from crawling your site, add the following to your robots.txt file:

User-agent: news-please
Disallow: /

Alternatively, if you want to allow news-please to access your site but restrict it from certain sections, you can specify particular paths to disallow:

User-agent: news-please
Disallow: /private/
Disallow: /members-only/
Allow: /

If you want to specifically allow news-please to crawl your site while blocking other crawlers, you can add a specific directive in your robots.txt file:

User-agent: news-please
Allow: /

Blocking news-please will prevent researchers from automatically collecting data from your site using this tool, which may limit the inclusion of your content in research studies or analyses. However, it won't affect your site's visibility to regular users or search engines. Keep in mind that while news-please respects robots.txt, some users of the tool might modify it to ignore these restrictions, though this would be against the intended usage of the tool.

news-please bot