Why only a few pages of my website are being crawled

Semrush Toolkits

SEO

Site Audit

Why are only a few of my website’s pages being crawled?

If you’ve noticed that only 4-6 pages of your website are being crawled (your home page, sitemaps URLs and robots.txt), most likely this is because our bot couldn’t find outgoing internal links on your Homepage. Below you will find possible reasons for this issue.

Problem with outgoing internal links

There may be no outgoing internal links on your main page, or they may be wrapped in JavaScript. Our bot parses JavaScript content only with the Guru and Business tiers of the SEO Toolkit subscription. Therefore, if you don’t have the Guru or Business level and your homepage contains links to other parts of your site within JavaScript elements, we won’t detect or crawl those pages.

Although JavaScript crawling is only available with the Guru and Business tiers, we can still crawl the HTML of pages containing JavaScript elements. Additionally, our Performance checks can review the parameters of your JavaScript and CSS files, regardless of your subscription tier.

In both cases, there is a way to ensure that our bot will crawl your pages. To do this, you need to change "Pages to crawl" from “website” to “sitemap” or “URLs from file” in your campaign settings:

Pages to crawl options are highlighted in the Site Audit Settings window.

“Website” is a default source. It means we will crawl your website using a breadth-first search algorithm and navigate through the links we see on your page’s code—starting from the homepage.

If you choose one of the other options, we will crawl links that are found in the sitemap or in the file you upload.

The Site Audit crawler could have been blocked

Our crawler could have been blocked on some pages in the website’s robots.txt or by noindex/nofollow tags. You can check if this is the case in your Crawled pages report:

How to check if our crawler was blocked on some pages in the Crawled pages report.

You can inspect your Robots.txt for any disallow commands that would prevent crawlers like ours from accessing your website.

If you see the code shown below on the main page of a website, it tells us that we’re not allowed to index/follow links on it and our access is blocked. Or, a page containing at least one of the two: "nofollow", "none", will lead to a crawling error.

<meta name="robots" content="noindex, nofollow">

You will find more information about these errors in our troubleshooting article.

Your Home page is more than 4 MB

Site Audit is currently equipped to parse homepages not larger than 4MB.

A pop up window that appears if your Site Audit fails to start. It states: 'We encountered an error that stopped us from crawling your website: The size of the main page is too large (more than 4 MB) for search engine crawlers to load it.'

A pop up window that appears if your Site Audit fails to start. It states: 'We encountered an error that stopped us from crawling your website: The size of the main page is too large (more than 4 MB) for search engine crawlers to load it.'

The limit for other pages of your website is 2MB. In case a page has too large HTML size, you will see the following error:

An example of the Issues report with a check 'html' entered in the search bar. In the list of errors there's one that states '1 page has too large HTML size'.