Can I scrape

About can I scrape

There are plenty of tools for creating and analyzing robots.txt for website creators. On the other hand, there aren't many resources to help people making web crawlers and scrapers properly respect the rules that website creators set.

That's where Can I scrape comes in. You can use this tool to lookup if you're able to scrape a specific page and check if the website creator allows you to scrape and index it.

AI Bots and the Modern Web

AI has introduced new types of web crawlers with different purposes. Today's bots fall into three main categories: AI training bots (like GPTBot and ClaudeBot) that collect data to train machine learning models, AI search bots (like Claude Search and OpenAI Search) that fetch real-time information for AI-powered searches, and user-initiated AI agents (like ChatGPT-User and Perplexity-User) that act on behalf of individual users.

Website owners may want to treat these differently. Traditional search engines drive traffic back to your site, while AI training bots may use your content without attribution. This tool helps you understand and test how your robots.txt rules apply to all types of crawlers—from traditional search to cutting-edge AI.

How do you know if you can scrape a website?

The Robots Exclusion Protocol is a way for website owners to tell web robots and crawlers which pages should and should not be crawled and indexed.

There are three ways websites can set rules for robots: the robots.txt file, an X-Robots-Tag header, and the robots meta tag. If the website doesn't have a rule against crawling or indexing a page, then your robot is ok to crawl and index it!

Should you honor these rules?

If you are crawling, indexing, or scraping content, you should honor the website's rules. If you are acting purely on behalf of humans, however, it might make sense to ignore the rules.

While no laws enforce these rules, following them is part of being a good digital citizen and stating that you follow them can establish a positive reputation. This internet standard is followed by major search engines, including Google, Bing, and DuckDuckGo.

Some websites, like LinkedIn, also have protections in place against robots that don't follow the rules established in the robots.txt file. Crawling web pages that are protected in the robots.txt can quickly get your robot rate-limited or blocked.

However some times, it makes sense to ignore these rules. For example, Slack states that they "do not currently honor robots.txt files" because their robot only visits pages when a human specifically links to it in a Slack team, and so isn't a crawler.

In sum, it depends on what your robot is doing and why. If your bot is crawling, indexing or scraping content to gather it all or use it for some other purpose, then you should probably honor the website's rules. If your bot only goes to work on behalf of a human for one page at a time (like Slack's) then you might decide to ignore the rules entirely.

Special Considerations for AI Bots

AI bots have sparked new debates about web scraping ethics. Major AI companies like OpenAI, Anthropic, and Perplexity have committed to respecting robots.txt, but the ethical landscape is complex.

Traditional search engines create a value exchange: they index your content and drive traffic back to your site. AI training bots, however, may incorporate your content into model weights without attribution or traffic. This difference has led many publishers to allow traditional search crawlers (like Googlebot) while blocking AI training bots (like GPTBot).

If you're building an AI system, consider which category your bot falls into. Training bots should be especially respectful of robots.txt and directives like noarchive and noindex. AI-powered search engines should follow similar rules to traditional search. User-initiated AI agents that fetch content on behalf of a specific user might follow Slack's model of ignoring robots.txt, but this remains controversial.

What rules should your robot follow?

There are a bunch of rules, called directives, that websites can set. Most importantly, your bot should not crawl or index pages if there are directives against it.

The other directives are dependent on why your bot is collecting the links and content. Not all rules will be relevant for your bots.

Crawling

All bots should respect the directive of whether or not to crawl a web page.

Crawling for bots is the equivalent of a human visiting a web page. To access the content, you need to crawl it. All bots need to crawl web pages. Search engine bots crawl pages to get the content to search and generate the snippet previews you see underneath the link. AI training bots like GPTBot and ClaudeBot crawl pages to collect training data. At the end of the day, all bots should listen to whether or not a web page should be crawled.

Indexing

If you are creating a bot that gathers a list of links, you should honor directives about indexing, following links, and displaying snippets.

Indexing is when you compile a list of links for some later use. Search engines are a great example of this. When Google indexes a page, their bot first crawls the page, then it adds it to their database, so they can display it at a later date when someone searches for it. However, after Google crawls the page they might come across a directive that says they can't index it. If they find that rule, then they won't add it to the database and that page won't show up in search results.

Other directives set how long the text snippet should be, and how large of an image to use when displaying the link in your index. These directives can help you gather a better index of links as well as generate high-quality snippets and previews.

Reference

Website creators can share their preferences about the web scraping and crawling capabilities of their site. Let's dive into the possibilities.

robots.txt

The robots.txt file defines whether or not a web robot should crawl and access a file. The access can be configured for a specific user agent, or set across the board. While not enforced through legal methods, following these preferences is an important part of being a good web citizen. Curious about why?

examples

Open the whole site to crawling: User-agent: *
Disallow:
Block the Bing crawler: User-agent: bingbot
Disallow: /
Block jpg images from being crawled: User-agent: *
Disallow: /*.jpg$
Standard WordPress robots.txt: User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.example.com/sitemap_index.xml

robots <meta/> tags and X-Robots-Tag headers

Once a web robot crawls a web page, there are additional instructions, called directives, about how the web page should be indexed. The website owner sets these rules through the robots <meta/> tags and X-Robots-Tag headers. If you're scraping pages but not indexing them, these most likely don't apply to you.

Indexing and following

These directives relate to whether the web robot should index the given page and images, and whether it should follow other links on the page.

noindex: Whether or not the page should be indexed.; <meta name="robots" content="noindex" />
noimageindex: Whether or not the images on the page should be indexed.; <meta name="robots" content="noimageindex" />
nofollow: Whether or not the web crawler should follow links it finds on this page.; <meta name="robots" content="nofollow" />

Caching and availability

Website creators can set their preferences for how the page is stored and modified once it is indexed by your web crawler.

unavailable_after: When the page should be de-indexed.; <meta name="robots" content="unavailable_after: Sunday, 01-Sep-24 01:00:00 PDT" />
noarchive: Whether or not the web crawler should archive or cache this page.; <meta name="robots" content="noarchive" />
notranslate: Whether or not the web crawler should translate this page to other languages.; <meta name="robots" content="notranslate" />

Snippets and previews

The snippet and preview directives allow website owners to specify their preferences for how the link to this specific page is displayed. Like the caching and availability directives, these only apply if the page is indexed.

nosnippet: Whether or not a snippet should be shown for a link.; <meta name="robots" content="nosnippet" />
max-snippet: Maximum number of characters that should be taken from the page for the snippet.; <meta name="robots" content="max-snippet:50" />
max-image-preview: Size of the image that should be used for the preview.
Values: none, standard, large; <meta name="robots" content="max-image-preview:standard" />
max-video-preview: Maximum number of seconds to use from a video for the video preview.; <meta name="robots" content="max-video-preview:5" />