A Fundamental Understanding Of robots.txt – A Beginner’s Guide
Understanding Of robots.txt
In the previous chapter – 12 Tips To Make A Mobile Friendly Website, we learned what makes a website design – Mobile Friendly. In this chapter – A Fundamental Understanding Of robots.txt – we will learn how to read, write and use robots.txt for your website. We will also understand its role in SEO.
- What is the use of “robots.txt” file?
- What does the file “robots.txt” contain?
- Example content for “robots.txt” files and its meaning
- Where is the file “robots.txt” present in our website?
- The advantage of having a “robots.txt” file
- How to create a robots.txt file for your website?
- What happens to the links present in the page blocked by “robots.txt”?
- What is the difference between“robots.txt” and “meta-robots” tag?
- A Few Important Points To Be Noted:
“robots.txt” is a text file that instructs search engine bots (also called as “crawlers” or “spiders” or “User agents”) on whether they are allowed or disallowed to crawl and index certain parts of the website.
Because “robots.txt” file contains information about how search engine crawlers should crawl the entire website. This is the first file that crawlers look for in when they arrive at a website for crawling and indexing purpose.
NOTE: The name of the file should be “robots.txt” only, and not “Robot.txt” or “ROBOT.txt” or “Robot.TXT” or anything else.
Every basic “robots.txt” file contains two following elements in it called “directives” or “attributes”, followed by a colon (:).
1. User-Agent – The name of the search engine bot for which instructions are being given
2. Disallow – A directive which instructs the User-agent as to which part of the website structure is disallowed for crawling.
More advanced “robots.txt” file may contain 3 more elements apart from the above two.
3. Allow – A directive which instructs the User-agent as to which part of the website structure is allowed for crawling. This directive is only applicable to Google’s Googlebot spider.
4. Crawl-delay – A directive which gives instruction about, how many seconds a User-agent should wait before loading and crawling the content of the web page. This directive is not considered as a standard and is ignored by Google’s Googlebot spider.
e. Sitemap – A directive which informs the User-agent about the location of the XML sitemap. This command is only supported by Google, Ask, Bing, and Yahoo. And is always considered as good practice by SEO professionals to add URL of the website’s XML sitemap.
a. The below content informs “Googlebot” not to crawl the ‘login’ directory of the website
b. The below content indicates that none of the web crawlers should crawl the login directory of the website. Star (*) indicates all.
c. The below content indicates that none of the web crawlers should crawl the login directory of the website. But, can crawl the pages from ‘search’ directory that is within the ‘login’ directory, but not others.
d. The below content indicates that none of the web crawlers are allowed to crawl any parts of the website
e. The below content indicates that all web crawlers are allowed to crawl all parts of the website
f. The below content has two instruction set for two different User-agents. The first is for all the web crawlers. The second set is specifically for Google’s “Googlebot”. As separate instruction set is defined for “Googlebot”, it tends to ignore the first instruction set meant for all.
h. The below content instructs all web-crawlers not to crawl URLs ending with “gif”. Note that in the below content ‘$’ is a pattern matching element indicates and matches the end of the URL.
The file “robots.txt” will be and has to be present in the root directory of your domain. If you have multiple sub-domains, then, the file “robots.txt” has to be present in the root directory of your sub-domain.
To check if you have a “robots.txt” already on your website, type the URL:
or, if it is on your subdomain, type the URL:
If present, it will show you a file with some content, else, it will result in a 404 – Page Not Found error. That is why I personally recommend every webmaster to always have a robots.txt file on their site, even if its blank. Reason – To avoid search engines carrying away a 404-error impression when they first visit your site.
“Let us make the first impression the best impression”
There is absolutely no meaning in placing the “robots.txt” file elsewhere in the website directory structure. If “robots.txt” is present elsewhere in the website other than the root directory, web crawlers assume the file as missing and starts crawling the entire website.
You can visit the below links to find some real-time sample robots.txt files.
Some of the advantages of having “robots.txt” file on our website include:
- Help keep business related information and documents on our website private
- Avoid User-agents (or crawlers or spiders) to crawl sensitive information like customer login credentials
- To avoid showing our website’s duplicate content on search engine result pages
You can either create “robots.txt” manually or use some online tools that will auto-generate one for you based on your requirements. After creating you can upload it to your website’s root folder through the c-panel account.
Refer this document for detailed information: Create A robots.txt file
The links present on the pages blocked by “robots.txt” will not be followed by crawlers and as a result, no link equity or link juice will be passed on to the linked destination webpage. If the link on the blocked page is also linked by another page that isn’t blocked by “robots.txt”, then link equity will be passed on.
“robots.txt” controls the crawler activity (or behavior) for the entire pages of the website. Whereas, “meta-robots” tag granularly controls the crawler activity (or behavior) for individual pages.
There is only one “robots.txt” for a website. But each individual pages can have “meta-robots” tag to specify how crawlers can treat and index links on that particular page.
- Files especially not excluded as part of “robots.txt” will be crawled and indexed by search engines.
- The “attributes” or “directives” are case sensitive which means, for example, in ‘Disallow’, the capital ‘D’ cannot be small ‘d’. Similarly, ‘User-agent’ cannot be ‘user-agent’.
- Space is mandatory after the colon in the directive. For example: “Allow: <space> /tmp”
- To block an entire directory, place a forward slash before and after the directory name. For example: Disallow: /secretfolder/
Congratulations! You are done with the fourteenth chapter on “A Fundamental Understanding Of robots.txt “. Hope you enjoyed the reading.
All the best for your next chapter on “Fundamental Understanding Of Meta Robots Tag ”. In the next chapter, you will learn the role of Meta-Robots tag in SEO and how to use them to control search engines in indexing individual pages.
Feel free to comment below if this blog post was useful or not. If yes, please do me a favor by sharing it with others who might benefit.
Interested In Full Time Digital Marketing Course?
Feel free to check out the modules covered in DIGITAL MARKETING TRAININGInterested In SEO Course?
Feel free to check out the modules covered in SEO TRAINING
Subhash.K.U is a Professional Programmer turned Digital Marketing Enthusiast. He is the most sought marketing consultants for small and medium scale businesses. He founded Subhash Digital Academy to teach professional digital marketing skills to students, entrepreneurs, and working professionals. He holds a Bachelor’s degree in Electrical Engineering and is an Oracle Certified Programmer. He also holds certificates of Google AdWords, Facebook Blueprint and Hubspot Marketing. He is the co-author of the best selling book – Cracking The C, C++ and Java Interview published by McGraw Hill. He is now penning another book on the subject of marketing and entrepreneurship.