Kangaroo Bot

What is Kangaroo Bot?

Kangaroo Bot is a specialized web crawler developed and operated by Kangaroo LLM, an Australian AI consortium. This data scraper systematically collects textual content from Australian websites to build a comprehensive dataset that captures the unique linguistic patterns and cultural nuances of Australian English. The bot identifies itself in server logs with the user-agent string "Kangaroo Bot" and primarily targets Australian domains and servers physically located within Australia.

Unlike global AI scrapers, Kangaroo Bot focuses exclusively on Australian content, employing geographic filtering mechanisms to ensure data relevance. It follows standard web crawling etiquette by respecting robots.txt exclusion protocols, limiting request rates to approximately 2.3 requests per second per domain, and implementing exponential backoff when encountering server errors. The crawler uses a rotational IP strategy with Australian ASNs (Autonomous System Numbers) to maintain data provenance while minimizing the risk of IP-based blocking.

Why is Kangaroo Bot crawling my site?

If Kangaroo Bot is visiting your website, it's likely because your site contains Australian content that could contribute to training Australia's first open-source large language model. The bot prioritizes websites based on several factors:

  • Lexical density (words per page)
  • Update frequency
  • Presence of user-generated content
  • Domain authority metrics

Kangaroo Bot particularly targets content-rich resources like forums, news outlets, and educational institutions that demonstrate high levels of vernacular language use. If your site is hosted on an Australian domain (.au) or physically located on servers in Australia, it's more likely to be visited by this crawler.

The frequency of visits depends on your site's content volume and update patterns. Sites with regularly refreshed, text-heavy content may experience more frequent crawling sessions than static sites with minimal textual information.

What is the purpose of Kangaroo Bot?

Kangaroo Bot's primary purpose is to collect data for the VegeMighty dataset, which will power Australia's first open-source large language model. This initiative aims to create an AI that truly understands Australian language and culture, rather than relying on models trained predominantly on American or global English content.

The project emphasizes data sovereignty by processing and storing all information within Australian infrastructure while adhering to ethical guidelines. Unlike many global AI data collection efforts, Kangaroo Bot implements an opt-out-plus model where website owners can control how their content is used.

The collected data undergoes a multi-stage processing workflow that includes:

  1. De-identification of personal information
  2. Preservation of uniquely Australian cultural references
  3. Privacy-enhancing techniques during dataset compilation

The ultimate goal is to boost Australian AI innovation, create specialized tech jobs, and ensure Australia's digital future remains in Australian hands by developing language models that accurately represent the country's linguistic diversity.

How do I block Kangaroo Bot?

Kangaroo Bot respects the robots.txt protocol, making it straightforward to control its access to your website. To completely block the bot, add the following directives to your robots.txt file:

User-agent: Kangaroo Bot
Disallow: /

If you want to allow Kangaroo Bot to access only specific sections of your site, you can use more targeted directives:

User-agent: Kangaroo Bot
Allow: /public/
Disallow: /

Beyond robots.txt controls, Kangaroo LLM offers additional options through their opt-out-plus model:

  1. Complete blocking via robots.txt
  2. Allowing crawling while requesting content exclusion from the final dataset
  3. Participating in selective content contribution

While blocking Kangaroo Bot won't negatively impact your site's visibility in traditional search engines, it does mean your content won't contribute to the development of Australian-specific AI models. By allowing access, you're helping create technology that better understands and represents Australian language and culture.

Something incorrect or have feedback?
Share feedback
Kangaroo Bot logo

Operated by

Data collector

Documentation

Go to docs

AI model training

Used to train AI or LLMs

Acts on behalf of user

No, operates independently of any user action

Obeys directives

Yes, obeys robots.txt rules

User Agent

Kangaroo Bot