There are plenty of tools for creating and analyzing robots.txt for website creators. On the other hand, there aren't many resources to help people making web crawlers and scrapers properly respect the rules that website creators set.
That's where Can I scrape comes in. You can use this tool to lookup if you’re able to scrape a specific page and use the API to programmatically check if the website creator allows you to scrape and index any page on the fly.
The Robots Exclusion Protocol is a way for website owners to tell web robots and crawlers which pages should and should not be crawled and indexed.
There are three ways websites can set rules for robots: the robots.txt file, an X-Robots-Tag header, and the robots meta tag. If the website doesn’t have a rule against crawling or indexing a page, then your robot is ok to crawl and index it!
If you are crawling, indexing, or scraping content, you should honor the website’s rules. If you are acting purely on behalf of humans, however, it might make sense to ignore the rules.
While no laws enforce these rules, following them is part of being a good digital citizen and stating that you follow them can establish a positive reputation. This internet standard is followed by major search engines, including Google, Bing, and DuckDuckGo.
Some websites, like LinkedIn, also have protections in place against robots that don’t follow the rules established in the robots.txt file. Crawling web pages that are protected in the robots.txt can quickly get your robot rate-limited or blocked.
However some times, it makes sense to ignore these rules. For example, Slack states that they “do not currently honor robots.txt files” because their robot only visits pages when a human specifically links to it in a Slack team, and so isn’t a crawler.
In sum, it depends on what your robot is doing and why. If your bot is crawling, indexing or scraping content to gather it all or use it for some other purpose, then you should probably honor the website’s rules. If your bot only goes to work on behalf of a human for one page at a time (like Slack’s) then you might decide to ignore the rules entirely.
There are a bunch of rules, called directives, that websites can set. Most importantly, your bot should not crawl or index pages if there are directives against it.
The other directives are dependent on why your bot is collecting the links and content. Not all rules will be relevant for your bots.
All bots should respect the directive of whether or not to crawl a web page.
Crawling for bots is the equivalent of a human visiting a web page. To access the content, you need to crawl it. All bots need to crawl web pages. For example, bots that power enrichment tools like Clearbit and Hunter crawl and scrape data. Search engine bots crawl pages to get the content to search and generate the snippet previews you see underneath the link. At the end of the day, all bots should listen to whether or not a web page should be crawled.
If you are creating a bot that gathers a list of links, you should honor directives about indexing, following links, and displaying snippets.
Indexing is when you compile a list of links for some later use. Search engines are a great example of this. When Google indexes a page, their bot first crawls the page, then it adds it to their database, so they can display it at a later date when someone searches for it. However, after Google crawls the page they might come across a directive that says they can’t index it. If they find that rule, then they won’t add it to the database and that page won’t show up in search results.
Other directives set how long the text snippet should be, and how large of an image to use when displaying the link in your index. These directives can help you gather a better index of links as well as generate high-quality snippets and previews.
Website creators can share their preferences about the web scraping and crawling capabilities of their site. Let's dive into the possibilities.
robots.txt file defines whether or not a web robot should
crawl and access a file. The access can be configured for a specific user
agent, or set across the board. While not enforced through legal methods,
following these preferences is an important part of being a good web
citizen. Curious about why?
Once a web robot crawls a web page, there are additional instructions,
called directives, about how the web page should be indexed. The website
owner sets these rules through the robots
X-Robots-Tag headers. If you’re scraping pages but not
indexing them, these most likely don’t apply to you.
These directives relate to whether the web robot should index the given page and images, and whether it should follow other links on the page.
Website creators can set their preferences for how the page is stored and modified once it is indexed by your web crawler.
The snippet and preview directives allow website owners to specify their preferences for how the link to this specific page is displayed. Like the caching and availability directives, these only apply if the page is indexed.