Can I scrape

Can I scrape is a way to check if robots are allowed to crawl and index web pages. Use the API to validate scraping permissions pages on the fly.

About can I scrape

There are plenty of tools for creating and analyzing robots.txt for website creators. On the other hand, there aren't many resources to help people making web crawlers and scrapers properly respect the rules that website creators set.

That's where Can I scrape comes in. You can use this tool to lookup if you’re able to scrape a specific page and use the API to programmatically check if the website creator allows you to scrape and index any page on the fly.


How do you know if you can scrape a website?

The Robots Exclusion Protocol is a way for website owners to tell web robots and crawlers which pages should and should not be crawled and indexed.

There are three ways websites can set rules for robots: the robots.txt file, an X-Robots-Tag header, and the robots meta tag. If the website doesn’t have a rule against crawling or indexing a page, then your robot is ok to crawl and index it!

Should you honor these rules?

If you are crawling, indexing, or scraping content, you should honor the website’s rules. If you are acting purely on behalf of humans, however, it might make sense to ignore the rules.

While no laws enforce these rules, following them is part of being a good digital citizen and stating that you follow them can establish a positive reputation. This internet standard is followed by major search engines, including Google, Bing, and DuckDuckGo.

Some websites, like LinkedIn, also have protections in place against robots that don’t follow the rules established in the robots.txt file. Crawling web pages that are protected in the robots.txt can quickly get your robot rate-limited or blocked.

However some times, it makes sense to ignore these rules. For example, Slack states that they “do not currently honor robots.txt files” because their robot only visits pages when a human specifically links to it in a Slack team, and so isn’t a crawler.

In sum, it depends on what your robot is doing and why. If your bot is crawling, indexing or scraping content to gather it all or use it for some other purpose, then you should probably honor the website’s rules. If your bot only goes to work on behalf of a human for one page at a time (like Slack’s) then you might decide to ignore the rules entirely.

What rules should your robot follow?

There are a bunch of rules, called directives, that websites can set. Most importantly, your bot should not crawl or index pages if there are directives against it.

The other directives are dependent on why your bot is collecting the links and content. Not all rules will be relevant for your bots.

Crawling

All bots should respect the directive of whether or not to crawl a web page.

Crawling for bots is the equivalent of a human visiting a web page. To access the content, you need to crawl it. All bots need to crawl web pages. For example, bots that power enrichment tools like Clearbit and Hunter crawl and scrape data. Search engine bots crawl pages to get the content to search and generate the snippet previews you see underneath the link. At the end of the day, all bots should listen to whether or not a web page should be crawled.

Indexing

If you are creating a bot that gathers a list of links, you should honor directives about indexing, following links, and displaying snippets.

Indexing is when you compile a list of links for some later use. Search engines are a great example of this. When Google indexes a page, their bot first crawls the page, then it adds it to their database, so they can display it at a later date when someone searches for it. However, after Google crawls the page they might come across a directive that says they can’t index it. If they find that rule, then they won’t add it to the database and that page won’t show up in search results.

Other directives set how long the text snippet should be, and how large of an image to use when displaying the link in your index. These directives can help you gather a better index of links as well as generate high-quality snippets and previews.

Reference

Website creators can share their preferences about the web scraping and crawling capabilities of their site. Let's dive into the possibilities.

robots.txt

The robots.txt file defines whether or not a web robot should crawl and access a file. The access can be configured for a specific user agent, or set across the board. While not enforced through legal methods, following these preferences is an important part of being a good web citizen. Curious about why?

examples
Open the whole site to crawling
User-agent: *
Disallow:
Block the Bing crawler
User-agent: bingbot
Disallow: /
Block jpg images from being crawled
User-agent: *
Disallow: /*.jpg$
Standard WordPress robots.txt
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.example.com/sitemap_index.xml

robots <meta/> tags and X-Robots-Tag headers

Once a web robot crawls a web page, there are additional instructions, called directives, about how the web page should be indexed. The website owner sets these rules through the robots <meta/> tags and X-Robots-Tag headers. If you’re scraping pages but not indexing them, these most likely don’t apply to you.

Indexing and following

These directives relate to whether the web robot should index the given page and images, and whether it should follow other links on the page.

noindex
Whether or not the page should be indexed.
<meta name="robots" content="noindex" />
noimageindex
Whether or not the images on the page should be indexed.
<meta name="robots" content="noimageindex" />
nofollow
Whether or not the web crawler should follow links it finds on this page.
<meta name="robots" content="nofollow" />
Caching and availability

Website creators can set their preferences for how the page is stored and modified once it is indexed by your web crawler.

unavailable_after
When the page should be de-indexed.
<meta name="robots" content="unavailable_after: Sunday, 01-Sep-24 01:00:00 PDT" />
noarchive
Whether or not the web crawler should archive or cache this page.
<meta name="robots" content="noarchive" />
notranslate
Whether or not the web crawler should translate this page to other languages.
<meta name="robots" content="notranslate" />
Snippets and previews

The snippet and preview directives allow website owners to specify their preferences for how the link to this specific page is displayed. Like the caching and availability directives, these only apply if the page is indexed.

nosnippet
Whether or not a snippet should be shown for a link.
<meta name="robots" content="nosnippet" />
max-snippet
Maximum number of characters that should be taken from the page for the snippet.
<meta name="robots" content="max-snippet:50" />
max-image-preview
Size of the image that should be used for the preview.
Values: none, standard, large
<meta name="robots" content="max-image-preview:standard" />
max-video-preview
Maximum number of seconds to use from a video for the video preview.
<meta name="robots" content="max-video-preview:5" />