Skip to main content
SEO & Technical Audit

Crawl Budget: A Practical Guide for Large Websites

Crawl Budget: A Practical Guide for Large Websites

An 800-page website rarely needs to “get more crawl budget.” An 80,000-URL site created by filter combinations can make Googlebot spend most of its time on pages nobody wants indexed. The symptoms may look similar in Search Console, but the remedies are completely different.

This guide is for ecommerce sites, publishers, marketplaces, and programmatic SEO projects that see many URLs stuck in “Discovered – currently not indexed,” delayed sitemap discovery, or server logs dominated by parameterized pages.

What is crawl budget?

Crawl budget is the set of URLs Google can and wants to crawl on a site over a period of time. It has two components:

  • Crawl capacity limit: how many requests Googlebot can make without overwhelming the server.
  • Crawl demand: how much Google wants to revisit URLs based on freshness, popularity, quality, and value.

A faster server improves capacity, but it does not guarantee more crawling. If thousands of URLs are thin, nearly identical, or have no search demand, crawl demand remains low.

Google says crawl budget is mainly relevant to very large sites, sites with roughly 10,000 or more URLs that change daily, or sites with a large share of “Discovered – currently not indexed” URLs. For a small blog, a clean sitemap and sensible internal linking are normally enough.

First confirm that crawl budget is the problem

Do not begin by editing robots.txt. Answer four questions:

  1. Are important URLs crawled soon after publication or meaningful updates?
  2. Do many valid URLs remain “Discovered – currently not indexed” for weeks?
  3. Does Crawl Stats show Googlebot spending requests on parameters, internal search, or duplicates?
  4. Does the server return frequent 5xx, timeouts, or slow responses during crawl spikes?

If the answer is no across the board, quality, internal linking, or indexability is more likely than crawl budget.

Start with server logs, not intuition

GA4 is not a crawl analysis tool. Googlebot behaves differently from a user and may not appear completely in analytics. Access logs show which URLs were requested, their status, and response time.

A useful baseline report includes:

Data group Question it answers
Most crawled URLs Is Googlebot repeating low-value pages?
Status codes Are redirects, 404s, 429s, or 5xx responses common?
Crawler type Is it Smartphone, Image, or another crawler?
Response time Which templates slow the server?
Crawl timing Does crawling overlap with heavy batch jobs?

Do not trust a user agent blindly. Bad bots can claim to be Googlebot. Verify reverse DNS or official IP ranges before including requests in your crawl report.

Seven ways to improve crawl efficiency

1. Reduce the URL inventory Google can discover

Faceted navigation is a common culprit. A category with 10 filters, multiple values per filter, and unrestricted combinations can generate tens of thousands of URLs.

Separate them into three groups:

  • Search demand and unique value: index, self-canonicalize, and link internally.
  • Useful to visitors but not intended for search: control crawling and indexing deliberately.
  • No value: stop generating or linking to them.

Canonical tags do not eliminate crawling because Google must fetch the page to read the tag. The most effective fix is to stop distributing useless URLs.

2. Treat the sitemap as a priority list

A sitemap is not a database dump. Include only URLs that are:

  • canonical;
  • returning 200;
  • indexable;
  • substantial enough to be useful;
  • assigned a truthful <lastmod> date.

Changing every lastmod whenever a footer deploys makes the signal less useful. Update it when primary content, product data, or meaningful information changes.

Use the Sitemap Checker to inspect sitemap indexes, URL counts, and lastmod coverage.

Googlebot can follow redirects, but A → B → C → D delays arrival and increases maintenance. Internal links should go directly to D, and the sitemap should contain only D.

After a migration, do not remove old redirects too early. Collapse chains into one hop and monitor logs to see whether legacy URLs still receive requests. The Redirect Chain Checker exposes every hop quickly.

4. Return meaningful status codes

A permanently removed product should return 404 or 410, not a “not found” page with status 200. Soft 404s keep crawlers processing URLs with no value.

Repeated 5xx errors and timeouts are more serious. When the server fails, crawl capacity can fall to avoid added load. Group errors by template and time to distinguish application defects from traffic spikes.

5. Make pages cheaper to fetch and render

Fast responses, compact HTML, and stable resources let Google process more content under the same conditions. TTFB, caching, database queries, images, and JavaScript affect crawl efficiency as well as user experience.

JavaScript-heavy pages also need a rendering phase after HTML is crawled. Server-side rendering or pre-rendering critical content reduces dependence on the rendering queue.

An important URL that exists only in a sitemap can still look weak. Categories, hubs, breadcrumbs, and related content should create natural crawl paths to priority products and landing pages.

A practical rule: if the content team cannot explain how a visitor reaches the page, a crawler may have the same difficulty.

7. Plan large publishing and update batches

Publishers and marketplaces can create load spikes by changing hundreds of thousands of records at once. Sensible batching, sitemap indexes by content type, and accurate lastmod values help crawlers focus on what changed.

Crawl-budget “hacks” that backfire

Adding noindex everywhere

Google must crawl a URL to see noindex, so it does not remove the request. Robots rules also should not be treated as a casual budget dial; blocked URLs can remain known and queued for a long time.

Blocking CSS and JavaScript

Google needs key resources to render and evaluate pages. Blocking them can make a page appear incomplete or unfriendly on mobile.

Replacing sitemaps to force priority

Sitemaps support discovery; they do not allocate a guaranteed quota. A clean sitemap clarifies inventory, but it cannot make thin content worth crawling.

Raising crawl rate at any cost

If the server is slow or failing, more crawl activity makes the problem worse. Restore crawl health before worrying about crawl volume.

A 14-day plan for a large site

Days 1–3: establish a baseline

Export Crawl Stats, indexing reports, and 7–14 days of access logs. Group URLs by template, status, parameter type, and response time.

Days 4–6: select the three largest waste sources

They are often faceted URLs, redirect chains, and soft 404s. Fixing 20 issues at once makes attribution impossible.

Days 7–10: fix the generator

Remove parameter links from components, normalize URL creation, update sitemap generators, and return correct status codes. Test on staging, while remembering crawlers evaluate the public environment.

Days 11–14: monitor the release

Compare requests to priority URLs, 5xx frequency, response time, and indexing movement. Crawl patterns do not change overnight, but logs provide earlier evidence than aggregate reports.

Search Console versus logs

Search Console is best for crawl trends, indexing states, and site-wide errors. Logs answer the detailed questions: which bot requested which URL, at what time, with what status and latency.

The sources complement each other. Search Console provides the map; logs show the footprints.

What success should look like

Do not judge the project by total Googlebot requests. A successful crawl-efficiency release may keep request volume flat while changing where those requests go.

Track a small scorecard for four to six weeks:

  • the share of Googlebot requests reaching canonical 200 pages;
  • requests wasted on parameters, redirects, soft 404s, and errors;
  • median response time for the most crawled templates;
  • time between publishing an important URL and its first crawl;
  • the number of valuable URLs remaining discovered but not indexed;
  • server errors during crawl peaks.

The best outcome is not “more crawling” in isolation. It is faster discovery of important changes, fewer requests to useless variants, and stable server performance.

Seasonality matters. A marketplace before a major sale, a publisher during breaking news, or a catalog after a migration can experience temporary demand changes. Compare equivalent periods and annotate releases so one busy week does not become a false conclusion.

Finally, separate crawl from indexing. A URL can be fetched frequently and still remain unindexed because its content is thin, duplicate, or not useful enough. Once logs show Google reaches the page reliably, stop tuning crawl rules and improve the page itself.

Starting point: run a free SEO audit to find sitemap, robots, redirect, and indexability issues, then use server logs to measure actual crawling on a large site.

Conclusion

Crawl-budget optimization is not about finding a trick that makes Googlebot visit more often. It is about directing existing crawl activity toward valuable URLs while the server responds quickly and reliably. Reduce wasteful inventory, keep sitemaps honest, fix status codes, shorten redirects, and link internally by priority. If the site is not large enough for crawl budget to matter, improving content quality is usually the better investment.

References: Google Search Central – Crawl Budget Management and Googlebot documentation.

Advertisement

Frequently asked questions

How large must a site be before crawl budget matters?
There is no hard threshold. Google mainly points to very large sites, sites with roughly 10,000 URLs changing daily, or many URLs stuck as Discovered – currently not indexed.
Does submitting a sitemap increase crawl budget?
A sitemap supports discovery and prioritization, but it does not guarantee crawl quota. Content value, freshness, popularity, and server health still shape demand.
Should I use noindex to save crawl budget?
Noindex does not prevent a request because Google must crawl the page to read it. Reduce waste at the URL generator and internal-link level instead.
#Technical SEO #Search Console

Nhận bản tóm tắt SEO checklist qua email

Đăng ký để nhận bản tóm tắt các bước tối ưu SEO quan trọng nhất từ bài viết này.

Check your website for free

Run an SEO audit or check your traffic quality now — no signup required.