
When it comes to robots.txt, most people normally use an out-of-date file with data, and simply copy-paste info from it without taking into account their own website and the platform they are using. Everything flows, everything changes. In this piece, we will show you how to adapt and optimize robots.txt for specifically your store that runs on Magento2.
I won’t reinvent a wheel by saying that robots.txt is located in the website’s root directory and it stores data with special instructions for web crawlers. So, if you need to stop indexing of certain categories or web pages, to set the right domain mirroring or recommend web crawlers to maintain a certain time frame between server file downloads, for instance, robots.txt creation is a must for you then.
Most websites carry robots.txt respective script that is available in website root folder. It can be checked via website robots.txt file as from http://<yourwebsitename>/robots.txt.
However, Magento 2 does not carry robots.txt file by default. Thus, it needs to be created, and it does take some time.
Robots.txt and its Role in Improving Magento
There are several factors that make robots.txt usage that important, as:
• SEO is king today, and robots.txt does a good job preventing duplicate content issues.
• it hides error logs and reports on them, core files and .SVN files, etc.
• it helps avoid unpredicted indexing, which will protect your Magento store from hacker attacks.
Root to Robots.txt in Magento 2
After logging in the Magento 2 Admin Panel, go to Content > Design > Configuration :
You will arrive at the Design Configuration section, click Edit then:
Scroll down and click Search Engine Robots :
This is exactly what you need:
Let’s now take a look at the must-know elements of robots.txt:
Host and Sitemap Directives
I’d like to attract your attention to ‘host’ and ‘sitemap’ directives. Basically, there’s no big difference between placing them at the top or at the bottom of this block. It is essential not to leave them out:
• Host: (www.)domain.com. Host directive sends web crawlers to the website, which can be listed either with ‘www.’ or without it, with ‘http’ or ‘https’.
• Sitemap: _http://www.domain.com/sitemap_en.xml_. Sitemap directive sends crawlers to the sitemap location, i.e. it is used to call out the location of any XML sitemap(s) associated with a certain URL. Note that solely Google, Ask, Bing and Yahoo support this command.
In fact, when it comes to Google or Bing, these configurations can be set through Search Console or Bing Webmaster Tools correspondingly. However, other search engines do not have such functionality, making it highly advisable to specify these data in the ‘Edit Custom Instruction of robots.txt file’.
While ‘host’ and ‘sitemap’ directives appear only once on a website, other directives are used in blocks.
Blocks of directives for specific crawlers
Robots.txt contains directives and several blocks of directives, such as:
User-agent:
It is used for all the agents (*) or specific agents, such as GoogleBot, and defines which bots the
following directives refer to.
User-agent: * is often used to show that the directives can be used for all the bots.
Top 10 widespread bots:
1. Googlebot,
2. Baiduspider,
3. MSN Bot/Bingbot,
4. Yandex Bot,
5. Soso Spider,
6. Exabot,
7. Sogou ,
8. Google Plus Share,
9. Facebook External Hittitle,
10. Google Feedfetcher.
Then another bot and its directives can be listed, which will make another block.
Disallow:
It is used for all the URLs (*) or specific URLs with a certain URL path.
Let me clarify. Various directives block or unblock not the identification of files or directories, but URLs used to find them. What is meant here is blocking or unblocking certain URL sets that contain a specified URL path.
One and the same web page can be available under one URL, and blocked under a different one.
Allow:
It is used for all the URLs (*) or specific URLs with a certain URL path.
In fact, solely disallow directive was originally supported. That’s probably why you won’t find loads of documentation on the allow directive. Nevertheless, it is still supported by some search engines and is handled differently. Studies showed that the number of the characters used in the allow directive is critical in comparison to ‘disallow’. The ‘allow’ directive will beat ‘disallow’ only if it has as many or more characters in the path.
Sometimes Magento 2 store owners use the following lines:
User-agent: Googlebot-Image
Disallow: /
Allow: /media/catalog/product/
Allow: /media/wysiwyg/
Allow: /media/images/
Crawl Delay
You can also set search priorities. The ‘crawl delay’ command is not supported directly, but there’s a possibility to lower the crawl priority inside Google Search Console. Considering the volume of Google market share, it’d be rational not to change your Google crawl priority.
Pieces of Advice on Filling ‘Edit Custom Instruction of robots.txt file’
1. System File Indexing Disallow
You’ve probably seen a great deal of such examples and recommendations for ‘Disallow’ lines:
# Directories
Disallow: /app/
Disallow: /bin/
Disallow: /dev/
Disallow: /lib/
Disallow: /phpserver/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /setup/
Disallow: /update/
Disallow: /var/
Disallow: /vendor/
# Files
Disallow: /composer.json
Disallow: /composer.lock
Disallow: /CONTRIBUTING.md
Disallow: /CONTRIBUTOR_LICENSE_AGREEMENT.html
Disallow: /COPYING.txt
Disallow: /Gruntfile.js
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /nginx.conf.sample
Disallow: /package.json
Disallow: /php.ini.sample
Disallow: /RELEASE_NOTES.txt
In fact, there is NO need to specify ‘disallow’ for all the system files and directories. It’s been a long while already since web crawlers have learned to differentiate them. So, the last thing web crawlers will stuff their servers with terabytes of data on millions of identical CMS-related files.
2. Personal data
This might be evident but never ever make your personal data easily accessible. Your customer base, personal profile, orders, credit card info, etc. Make sure to block access to such data, if you don’t want it to be violated.
So, stop crawling for:
- User account and checkout pages
Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/
- Search Pages
Make sure to block search pages for crawling in your online store. If you don’t, the page number in your store may increase dramatically. Magento 2 Search extensions come in handy here to check your search results URL. The following solution is most commonly used:
Disallow: /catalogsearch/
- Native catalog pages on Magento 2 – not the ones that are seen by customers after being edited.
- Pages with parameters that often appear after organizing Layered Navigation.
- Sometimes Webmasters block pages with filters.
Disallow: /*?dir*
Disallow: /*?dir=desc
Disallow: /*?dir=asc
Disallow: /*?limit=all
Disallow: /*?mode*
- Pages that appear as a result of specific extensions operation and their functionality.
Disallow: /tag/
Disallow: /customize/
Disallow: /sendfriend/
Disallow: /ajax/
Disallow: /sales/guest/form/
- Duplicate content
I’d highly recommend to check duplicate content with Google Search Console and Bing Webmaster Tools, and add ‘disallow’ directives for duplicate content thereof.
Disallow: /review/
Disallow: /quickview/
Disallow: /productalert/
Now, with all the above-mentioned info in mind, it’s high time to create robots.txt for your Magento 2 store. Don’t forget to double check it with the robots.txt tester in Google Search Console.
In fact, usage of robots.txt to block URLs doesn’t guarantee that those pages won’t be shown as URL only listings in their search results. If you want to completely block it, consider using noindex meta tag on the per page basis. Listing of the following code lines in the HTML head of the document will help them not index certain pages or links. Note that robots.txt is the primary driver if you use both robots.txt and meta tags.
Here’s the code:
· <meta name=”robots” content=”noindex”> – this variant allows you to have pages not indexed, but links followed,
· <meta name=”robots” content=”noindex,nofollow”> – this variant allows you to have both pages and links not indexed and followed correspondingly.
Robots.txt Creation: 2 Variants to Consider
There is probably a great deal of ways to edit robots.txt. I can password for 2 of them:
1) Edit robots.txt manually. If you’ve got access to website’s server, you can edit robots.txt in ‘Notepad’ or any other .txt editor of the kind and add it to its root directory. To do that:
· open ‘Notepad’ and copy-paste data,
· save file as robots.txt,
· upload file to the web hosting root folder.
2) Use pre-tailored solutions. If you are looking for a less time-consuming option, you should make use of an SEO Magento 2 extension. Good, there is a wide range of choice so far on the market. Just pick up the one that offers robots.txt editing functionality.
Bottom Line
There’s probably one thing to be added. You should always keep in mind that the robots.txt file covers only one domain. Thus, the Magento websites with the multiple domains or subdomains
should carry their own robots.txt files.
Hoping that you find this piece helpful!