Crawl Budget: A Practical Guide for Large Websites
An 800-page website rarely needs to “get more crawl budget.” An 80,000-URL site created by filter combinations can make Googlebot spend most of its time on pages nobody wants indexed. The symptoms may look similar in Search Console, but the remedies are completely different.
This guide is for ecommerce sites, publishers, marketplaces, and programmatic SEO projects that see many URLs stuck in “Discovered – currently not indexed,” delayed sitemap discovery, or server logs dominated by parameterized pages.
What is crawl budget?
Crawl budget is the set of URLs Google can and wants to crawl on a site over a period of time. It has two components:
- Crawl capacity limit: how many requests Googlebot can make without overwhelming the server.
- Crawl demand: how much Google wants to revisit URLs based on freshness, popularity, quality, and value.
A faster server improves capacity, but it does not guarantee more crawling. If thousands of URLs are thin, nearly identical, or have no search demand, crawl demand remains low.
Google says crawl budget is mainly relevant to very large sites, sites with roughly 10,000 or more URLs that change daily, or sites with a large share of “Discovered – currently not indexed” URLs. For a small blog, a clean sitemap and sensible internal linking are normally enough.
First confirm that crawl budget is the problem
Do not begin by editing robots.txt. Answer four questions:
- Are important URLs crawled soon after publication or meaningful updates?
- Do many valid URLs remain “Discovered – currently not indexed” for weeks?
- Does Crawl Stats show Googlebot spending requests on parameters, internal search, or duplicates?
- Does the server return frequent
5xx, timeouts, or slow responses during crawl spikes?
If the answer is no across the board, quality, internal linking, or indexability is more likely than crawl budget.
Start with server logs, not intuition
GA4 is not a crawl analysis tool. Googlebot behaves differently from a user and may not appear completely in analytics. Access logs show which URLs were requested, their status, and response time.
A useful baseline report includes:
| Data group | Question it answers |
|---|---|
| Most crawled URLs | Is Googlebot repeating low-value pages? |
| Status codes | Are redirects, 404s, 429s, or 5xx responses common? |
| Crawler type | Is it Smartphone, Image, or another crawler? |
| Response time | Which templates slow the server? |
| Crawl timing | Does crawling overlap with heavy batch jobs? |
Do not trust a user agent blindly. Bad bots can claim to be Googlebot. Verify reverse DNS or official IP ranges before including requests in your crawl report.
Seven ways to improve crawl efficiency
1. Reduce the URL inventory Google can discover
Faceted navigation is a common culprit. A category with 10 filters, multiple values per filter, and unrestricted combinations can generate tens of thousands of URLs.
Separate them into three groups:
- Search demand and unique value: index, self-canonicalize, and link internally.
- Useful to visitors but not intended for search: control crawling and indexing deliberately.
- No value: stop generating or linking to them.
Canonical tags do not eliminate crawling because Google must fetch the page to read the tag. The most effective fix is to stop distributing useless URLs.
2. Treat the sitemap as a priority list
A sitemap is not a database dump. Include only URLs that are:
- canonical;
- returning
200; - indexable;
- substantial enough to be useful;
- assigned a truthful
<lastmod>date.
Changing every lastmod whenever a footer deploys makes the signal less useful. Update it when primary content, product data, or meaningful information changes.
Use the Sitemap Checker to inspect sitemap indexes, URL counts, and lastmod coverage.
3. Remove redirect chains and stale links
Googlebot can follow redirects, but A → B → C → D delays arrival and increases maintenance. Internal links should go directly to D, and the sitemap should contain only D.
After a migration, do not remove old redirects too early. Collapse chains into one hop and monitor logs to see whether legacy URLs still receive requests. The Redirect Chain Checker exposes every hop quickly.
4. Return meaningful status codes
A permanently removed product should return 404 or 410, not a “not found” page with status 200. Soft 404s keep crawlers processing URLs with no value.
Repeated 5xx errors and timeouts are more serious. When the server fails, crawl capacity can fall to avoid added load. Group errors by template and time to distinguish application defects from traffic spikes.
5. Make pages cheaper to fetch and render
Fast responses, compact HTML, and stable resources let Google process more content under the same conditions. TTFB, caching, database queries, images, and JavaScript affect crawl efficiency as well as user experience.
JavaScript-heavy pages also need a rendering phase after HTML is crawled. Server-side rendering or pre-rendering critical content reduces dependence on the rendering queue.
6. Build internal links around business value
An important URL that exists only in a sitemap can still look weak. Categories, hubs, breadcrumbs, and related content should create natural crawl paths to priority products and landing pages.
A practical rule: if the content team cannot explain how a visitor reaches the page, a crawler may have the same difficulty.
7. Plan large publishing and update batches
Publishers and marketplaces can create load spikes by changing hundreds of thousands of records at once. Sensible batching, sitemap indexes by content type, and accurate lastmod values help crawlers focus on what changed.
Crawl-budget “hacks” that backfire
Adding noindex everywhere
Google must crawl a URL to see noindex, so it does not remove the request. Robots rules also should not be treated as a casual budget dial; blocked URLs can remain known and queued for a long time.
Blocking CSS and JavaScript
Google needs key resources to render and evaluate pages. Blocking them can make a page appear incomplete or unfriendly on mobile.
Replacing sitemaps to force priority
Sitemaps support discovery; they do not allocate a guaranteed quota. A clean sitemap clarifies inventory, but it cannot make thin content worth crawling.
Raising crawl rate at any cost
If the server is slow or failing, more crawl activity makes the problem worse. Restore crawl health before worrying about crawl volume.
A 14-day plan for a large site
Days 1–3: establish a baseline
Export Crawl Stats, indexing reports, and 7–14 days of access logs. Group URLs by template, status, parameter type, and response time.
Days 4–6: select the three largest waste sources
They are often faceted URLs, redirect chains, and soft 404s. Fixing 20 issues at once makes attribution impossible.
Days 7–10: fix the generator
Remove parameter links from components, normalize URL creation, update sitemap generators, and return correct status codes. Test on staging, while remembering crawlers evaluate the public environment.
Days 11–14: monitor the release
Compare requests to priority URLs, 5xx frequency, response time, and indexing movement. Crawl patterns do not change overnight, but logs provide earlier evidence than aggregate reports.
Search Console versus logs
Search Console is best for crawl trends, indexing states, and site-wide errors. Logs answer the detailed questions: which bot requested which URL, at what time, with what status and latency.
The sources complement each other. Search Console provides the map; logs show the footprints.
What success should look like
Do not judge the project by total Googlebot requests. A successful crawl-efficiency release may keep request volume flat while changing where those requests go.
Track a small scorecard for four to six weeks:
- the share of Googlebot requests reaching canonical
200pages; - requests wasted on parameters, redirects, soft 404s, and errors;
- median response time for the most crawled templates;
- time between publishing an important URL and its first crawl;
- the number of valuable URLs remaining discovered but not indexed;
- server errors during crawl peaks.
The best outcome is not “more crawling” in isolation. It is faster discovery of important changes, fewer requests to useless variants, and stable server performance.
Seasonality matters. A marketplace before a major sale, a publisher during breaking news, or a catalog after a migration can experience temporary demand changes. Compare equivalent periods and annotate releases so one busy week does not become a false conclusion.
Finally, separate crawl from indexing. A URL can be fetched frequently and still remain unindexed because its content is thin, duplicate, or not useful enough. Once logs show Google reaches the page reliably, stop tuning crawl rules and improve the page itself.
Starting point: run a free SEO audit to find sitemap, robots, redirect, and indexability issues, then use server logs to measure actual crawling on a large site.
Conclusion
Crawl-budget optimization is not about finding a trick that makes Googlebot visit more often. It is about directing existing crawl activity toward valuable URLs while the server responds quickly and reliably. Reduce wasteful inventory, keep sitemaps honest, fix status codes, shorten redirects, and link internally by priority. If the site is not large enough for crawl budget to matter, improving content quality is usually the better investment.
References: Google Search Central – Crawl Budget Management and Googlebot documentation.
Frequently asked questions
How large must a site be before crawl budget matters?
Does submitting a sitemap increase crawl budget?
Should I use noindex to save crawl budget?
Nhận bản tóm tắt SEO checklist qua email
Đăng ký để nhận bản tóm tắt các bước tối ưu SEO quan trọng nhất từ bài viết này.
Nhập email để tải template audit SEO 1 trang, dùng ngay cho website của bạn.
Check your website for free
Run an SEO audit or check your traffic quality now — no signup required.