You are here
Home > All > How to Optimize Robots.txt for Magento 2 Store

How to Optimize Robots.txt for Magento 2 Store

When it comes to robots.txt, most people normally use an out-of-date file with data, and simply copy-paste info from it without taking into account their own website and the platform they are using. Everything flows, everything changes. In this piece, we will show you how to adapt and optimize robots.txt for specifically your store that runs on Magento2.

I won’t reinvent a wheel by saying that robots.txt is located in the website’s root directory and it stores data with special instructions for web crawlers. So, if you need to stop indexing of certain categories or web pages, to set the right domain mirroring or recommend web crawlers to maintain a certain time frame between server file downloads, for instance, robots.txt creation is a must for you then.

Most websites carry robots.txt respective script that is available in website root folder. It can be checked via website robots.txt file as from http://<yourwebsitename>/robots.txt.
However, Magento 2 does not carry robots.txt file by default. Thus, it needs to be created, and it does take some time.

Robots.txt and its Role in Improving Magento

There are several factors that make robots.txt usage that important, as:
• SEO is king today, and robots.txt does a good job preventing duplicate content issues.
• it hides error logs and reports on them, core files and .SVN files, etc.
• it helps avoid unpredicted indexing, which will protect your Magento store from hacker attacks.

Root to Robots.txt in Magento 2

After logging in the Magento 2 Admin Panel, go to  Content > Design > Configuration :

Magento 2 robots.txt admin panel
You will arrive at the Design Configuration section, click  Edit  then:

robots.txt edition in Magento2
Scroll down and click  Search Engine Robots :

Magento2 robotx.txt settings

This is exactly what you need:

Magento 2 search engine robots.txt file

Let’s now take a look at the must-know elements of robots.txt:

Host and Sitemap Directives

I’d like to attract your attention to host and sitemap directives. Basically, there’s no big difference between placing them at the top or at the bottom of this block. It is essential not to leave them out:

Host: (www.)domain.com. Host directive sends web crawlers to the website, which can be listed either with ‘www.’ or without it, with ‘http’ or ‘https’.
Sitemap: _http://www.domain.com/sitemap_en.xml_. Sitemap directive sends crawlers to the sitemap location, i.e. it is used to call out the location of any XML sitemap(s) associated with a certain URL. Note that solely Google, Ask, Bing and Yahoo support this command.

In fact, when it comes to Google or Bing, these configurations can be set through Search Console or Bing Webmaster Tools correspondingly. However, other search engines do not have such functionality, making it highly advisable to specify these data in the ‘Edit Custom Instruction of robots.txt file.

While ‘host’ and ‘sitemap’ directives appear only once on a website, other directives are used in blocks.

Blocks of directives for specific crawlers

Robots.txt contains directives and several blocks of directives, such as:

User-agent:

It is used for all the agents (*) or specific agents, such as GoogleBot, and defines which bots the
following directives refer to.

User-agent: * is often used to show that the directives can be used for all the bots.

Top 10 widespread bots:

1. Googlebot,
2. Baiduspider,
3. MSN Bot/Bingbot,
4. Yandex Bot,
5. Soso Spider,
6. Exabot,
7. Sogou ,
8. Google Plus Share,
9. Facebook External Hittitle,
10. Google Feedfetcher.

Then another bot and its directives can be listed, which will make another block.

Disallow:

It is used for all the URLs (*) or specific URLs with a certain URL path.

Let me clarify. Various directives block or unblock not the identification of files or directories, but URLs used to find them. What is meant here is blocking or unblocking certain URL sets that contain a specified URL path.
One and the same web page can be available under one URL, and blocked under a different one.

Allow:

It is used for all the URLs (*) or specific URLs with a certain URL path.

In fact, solely disallow directive was originally supported. That’s probably why you won’t find loads of documentation on the allow directive. Nevertheless, it is still supported by some search engines and is handled differently. Studies showed that the number of the characters used in the allow directive is critical in comparison to ‘disallow’. The ‘allow’ directive will beat ‘disallow’ only if it has as many or more characters in the path.

Sometimes Magento 2 store owners use the following lines:

User-agent: Googlebot-Image
Disallow: /
Allow: /media/catalog/product/
Allow: /media/wysiwyg/
Allow: /media/images/

Crawl Delay

You can also set search priorities. The crawl delay’  command is not supported directly, but there’s a possibility to lower the crawl priority inside Google Webmaster Central. Considering the volume of Google market share, it’d be rational not to change your Google crawl priority.

Pieces of Advice on Filling ‘Edit Custom Instruction of robots.txt file’

1. System File Indexing Disallow

You’ve probably seen a great deal of such examples and recommendations for ‘Disallow’ lines:

# Directories

Disallow: /app/
Disallow: /bin/
Disallow: /dev/
Disallow: /lib/
Disallow: /phpserver/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /setup/
Disallow: /update/
Disallow: /var/
Disallow: /vendor/

# Files

Disallow: /composer.json
Disallow: /composer.lock
Disallow: /CONTRIBUTING.md
Disallow: /CONTRIBUTOR_LICENSE_AGREEMENT.html
Disallow: /COPYING.txt
Disallow: /Gruntfile.js
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /nginx.conf.sample
Disallow: /package.json
Disallow: /php.ini.sample
Disallow: /RELEASE_NOTES.txt

In fact, there is NO need to specify ‘disallow’ for all the system files and directories. It’s been a long while already since web crawlers have learned to differentiate them. So, the last thing web crawlers will stuff their servers with terabytes of data on millions of identical CMS-related files.

2. Personal data

This might be evident but never ever make your personal data easily accessible. Your customer base, personal profile, orders, credit card info, etc. Make sure to block access to such data, if you don’t want it to be violated.

So, stop crawling for:

  • User account and checkout pages

Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/

  • Search Pages

Make sure to block search pages for crawling in your online store. If you don’t, the page number in your store may increase dramatically. Magento 2 Search extensions come in handy here to check your search results URL. The following solution is most commonly used:

Disallow: /catalogsearch/

  •  Native catalog pages on Magento 2 – not the ones that are seen by customers after being edited.
  • Pages with parameters that often appear after organizing Layered Navigation.
  • Sometimes Webmasters block pages with filters.

Disallow: /*?dir*
Disallow: /*?dir=desc
Disallow: /*?dir=asc
Disallow: /*?limit=all
Disallow: /*?mode*

  • Pages that appear as a result of specific extensions operation and their functionality.

Disallow: /tag/
Disallow: /customize/
Disallow: /sendfriend/
Disallow: /ajax/
Disallow: /sales/guest/form/

  • Duplicate content

I’d highly recommend to check duplicate content with Google Search Console and Bing Webmaster Tools, and add ‘disallow’ directives for duplicate content thereof.

Disallow: /review/
Disallow: /quickview/
Disallow: /productalert/

Now, with all the above-mentioned info in mind, it’s high time to create robots.txt for your Magento 2 store. Don’t forget to double check it with the robots.txt tester in Google Search Console.

test robots.txt in Google Search Console

In fact, usage of robots.txt to block URLs doesn’t guarantee that those pages won’t be shown as URL only listings in their search results. If you want to completely block it, consider using noindex meta tag on the per page basis. Listing of the following code lines in the HTML head of the document will help them not index certain pages or links. Note that robots.txt is the primary driver if you use both robots.txt and meta tags.

Here’s the code:
· <meta name=”robots” content=”noindex”> – this variant allows you to have pages not indexed, but links followed,
· <meta name=”robots” content=”noindex,nofollow”> – this variant allows you to have both pages and links not indexed and followed correspondingly.

Robots.txt Creation: 2 Variants to Consider

There is probably a great deal of ways to edit robots.txt. I can password for 2 of them:

1) Edit robots.txt manually. If you’ve got access to website’s server, you can edit robots.txt in ‘Notepad’ or any other .txt editor of the kind and add it to its root directory. To do that:
· open ‘Notepad’ and copy-paste data,
· save file as robots.txt,
· upload file to the web hosting root folder.

2) Use pre-tailored solutions. If you are looking for a less time-consuming option, you should make use of an SEO Magento 2 extension. Good, there is a wide range of choice so far on the market. Just pick up the one that offers robots.txt editing functionality.

Bottom Line

There’s probably one thing to be added. You should always keep in mind that the robots.txt file covers only one domain. Thus, the Magento websites with the multiple domains or subdomains
should carry their own robots.txt files.

Hoping that you find this piece helpful!

 

Magelancer
Magelancer is an enthusiast keen on the eCommerce ecosphere and advanced marketing strategies. With more than 7 years of tech blogging experience, Magelancer gets down to bed-rock and strives for sharing valuable content.

Similar Articles

Leave a Reply

Top