Skip to main content
Traffic Quality & Fraud

Good Bots vs Bad Bots: Classify Before You Block

Good Bots vs Bad Bots: Classify Before You Block

An IP requests 600 pages in 10 minutes. If it is a price scraper, you want it stopped. If it is Googlebot discovering a new sitemap, blocking it can delay thousands of products. The behavior looks equally “fast,” but the operational decisions are opposite.

Good and bad bots cannot be separated by user agent, speed, or JavaScript support alone. Answer three questions: who is it, what is it doing, and what impact does it create?

Good, bad, and unknown bots

Good bots

A good bot serves a purpose the site accepts or benefits from:

  • search engine crawlers;
  • link, uptime, or security monitors you configured;
  • social sharing preview fetchers;
  • authorized accessibility or archiving services;
  • partner API clients under an agreement.

Good does not mean unrestricted. Googlebot follows robots rules for automatic crawling; your monitor should still have endpoint and rate boundaries.

Bad bots

Bad automation creates damage or accesses the site against its purpose:

  • fraudulent ad clicks;
  • price, content, or customer-data scraping;
  • credential stuffing and password attacks;
  • account creation, form spam, or inventory holding;
  • vulnerability scanning;
  • fake traffic that corrupts analytics.

Unknown bots

This is often the largest group: SEO tools, mobile applications, headless browsers, testing proxies, or new AI services. Immediate blocking may break an integration; unrestricted access creates risk.

Rate limiting and observation are often better than a binary decision for this group.

A user agent is not identification

A request saying:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

has not proven that it came from Google. Any script can copy that text.

Google recommends:

  1. reverse DNS should return a valid Google-owned hostname, and forward DNS should resolve back to the same IP;
  2. compare the source IP with Google's published crawler ranges.

Allowlisting only by user agent invites impersonation. Blocking by user agent alone misses bots pretending to be browsers.

A signal matrix for bot classification

No single signal is perfect. Combine several:

Signal Question
Network identity Residential, datacenter, VPN, Tor, or verified crawler range?
User agent Does it agree with TLS fingerprint and behavior?
Velocity Requests per minute, five minutes, and hour?
Navigation Does it load HTML, CSS, images, and normal paths?
Behavior Natural scroll, click, focus, reading time, and transitions?
Target Public content or login, checkout, and form endpoints?
Impact Resource cost, analytics distortion, or ad loss?

A legitimate crawler can be fast while verified, robots-aware, and focused on public content. A scraper may claim Chrome while skipping assets, rotating IPs, and harvesting only products.

Six suspicious behavior patterns

1. High velocity without human rhythm

Hundreds of clicks spaced exactly 200ms apart are unlikely to be human. Legitimate crawlers are also fast, so combine velocity with identity and target URLs.

2. Requests focus only on business-value endpoints

Inventory bots call add-to-cart; credential bots focus on login; click fraud repeatedly hits ad landings. Skipping normal navigation is a strong clue.

3. Fingerprint and user agent disagree

The user agent claims Safari on iPhone while TLS, header order, or viewport resembles desktop headless Chromium. The inconsistency is more useful than the user-agent text itself.

4. IPs rotate while device evidence remains

Residential proxies let automation switch addresses. If cookies, fingerprints, behavior, and targets repeat, blocking one IP at a time becomes endless.

5. Supporting assets are never loaded

Browsers normally fetch required CSS, JavaScript, fonts, and images. A simple scraper may request only HTML or APIs. Cache and privacy tools make this a supporting signal, not proof.

6. Failures advance through a list

Sequential usernames, coupons, card data, or admin paths reveal automation. IP limits alone fail against distributed bots; include account, fingerprint, and action dimensions.

Avoid blocking good bots with blunt rules

Blocking all datacenters

Many bad bots use datacenters, but monitors, webhooks, search crawlers, and business services do too. This may fit a sensitive coupon action but not an entire public site.

Blocking every fast client

Search crawlers and uptime monitors send valid bursts. Rate limits should reflect endpoint sensitivity and verified identity.

Requiring JavaScript challenges everywhere

Search crawlers can render some JavaScript, but anti-bot challenges, mandatory cookies, and CAPTCHAs may block discovery. Important public content needs a clean path for verified crawlers.

Trusting an old allowlist forever

IP ranges, services, and business needs change. Every allowlist entry needs a source, owner, review date, and usage log.

Response policy by risk

Level 1: Allow and log

For verified crawlers, owned monitoring, and known partners. Private areas still require authentication; “good bot” is not authorization.

Level 2: Soft rate limit

For unknown automation reading public content. Return 429 with Retry-After over the threshold rather than permanently blocking one short spike.

Level 3: Challenge or reduce capability

For medium-risk behavior: rapid search, repeated carts, or form submissions. Introduce JavaScript checks, threshold-based CAPTCHA, or reduced data.

Level 4: Block

For credential stuffing, exploit attempts, confirmed spam, substantiated click fraud, or high-quality denylist matches.

Level 5: Investigate and preserve evidence

For sustained ad fraud or attack campaigns, retain request ID, timestamp, hashed IP, user agent, fingerprint, campaign, and action. Avoid storing personal data beyond need; the objective is reproducible decisions.

Protect each surface differently

Content and SEO pages

Allow verified crawlers, cache aggressively, rate-limit unknown automation, and monitor scraping. Do not place CAPTCHA in front of articles or categories that need indexing.

Login and password reset

Limit by account and IP, detect username lists, use MFA, and alert on failure spikes. Residential bot IPs can still be caught by action patterns.

Lead forms

Honeypots, minimum completion time, server validation, and rate limits reduce spam. CAPTCHA alone is insufficient against human-assisted farms.

Checkout and coupons

Track velocity across account, device, card, and coupon. Datacenter blocks can help but may need exceptions for corporate customers.

Advertising landing pages

Record campaign, click ID, timestamp, and quality score. Compare ad-network clicks with sessions, logs, and conversions. Financial exclusions require stronger evidence than one suspicious request.

Investigating an IP or fingerprint

  1. Classify the address with the IP Checker.
  2. Verify reverse DNS when it claims a major crawler identity.
  3. Pull the action sequence across a 5–30 minute window.
  4. Compare user agent, headers, fingerprint, and asset requests.
  5. Review campaign, landing page, and action value.
  6. Assign confidence rather than an absolute true/false label.
  7. Select allow, limit, challenge, or block.
  8. Give temporary rules an expiry date.

A 0–100 score preserves the gray area:

  • 80–100: strong human signals;
  • 50–79: monitor or apply light limits;
  • below 50: multiple automation or risk signals.

Thresholds depend on the product. Blogs, banks, and games have different risk tolerance.

Treat false positives as a product metric

A system that stops 99% of bots but loses 5% of real customers may be worse than a less aggressive one. Monitor:

  • successful challenges;
  • blocked visitors contacting support;
  • conversion before and after a rule;
  • bots returning through new IPs;
  • endpoints with high false-positive rates.

Every rule needs a reason, owner, and rollback path. “This IP feels strange” is not an operating policy.

Govern bot rules like production code

Bot controls tend to accumulate. An emergency IP block added during an attack may remain for years, long after the address changes owner. Treat rules as versioned production logic.

Each rule should record:

  • the behavior or incident that justified it;
  • signals used and confidence level;
  • affected endpoints and user groups;
  • owner and approval date;
  • expiry or review date;
  • rollback instructions;
  • false-positive metrics to watch.

Test new controls in report-only mode whenever the risk allows. Log which requests would have been limited or blocked, then compare them with successful logins, purchases, verified crawlers, and support complaints. This shadow period reveals broad regexes and network rules before they affect customers.

Roll out by surface rather than globally. A strict rule may be appropriate for coupon redemption but harmful on documentation pages. Endpoint-specific policy also produces clearer metrics when something changes.

Keep an emergency bypass for verified operations and a fast rollback path. During a false-positive incident, waiting for a full application deployment can cost more than the automation the rule was designed to stop.

Finally, review allowlists as carefully as denylists. A compromised partner token or outdated crawler exception can become a permanent gap. Trust should be scoped, observable, and periodically revalidated.

Is robots.txt a bot-defense tool?

Robots.txt is a convention for cooperative crawlers. Malicious automation can ignore it. It remains essential for Googlebot and legitimate crawlers, but does not replace a WAF, authentication, rate limiting, or behavior analysis.

Use the Robots.txt Checker to ensure an anti-scraping change did not accidentally block search crawlers.

See the full picture: read how traffic Quality Score works and run a free SEO audit before applying a site-wide blocking rule.

Conclusion

Good and bad bots differ by more than speed or user-agent labels. Verifiable identity, target, behavior sequence, and business impact produce better decisions. Allow with controls, rate-limit the gray area, and block aggressively only when evidence is clear. A strong bot-defense system does not merely catch automation; it also knows when a legitimate crawler should pass.

References: Google Search Central – Googlebot and Google crawler overview.

Advertisement

Frequently asked questions

Can a user agent alone verify Googlebot?
No. User agents are easy to spoof. Use forward-confirmed reverse DNS or compare the source IP with official Google crawler ranges.
Should every datacenter IP be blocked?
Not site-wide. Datacenters host bad bots but also crawlers, monitors, webhooks, and legitimate business services. Evaluate the endpoint and behavior.
Can robots.txt block malicious bots?
Not reliably. It is a convention for cooperative crawlers; malicious bots may ignore it and require rate limits, a WAF, authentication, or behavior controls.
#Bot Traffic #VPN & Proxy #Quality Score

Nhận bản tóm tắt SEO checklist qua email

Đăng ký để nhận bản tóm tắt các bước tối ưu SEO quan trọng nhất từ bài viết này.

Check your website for free

Run an SEO audit or check your traffic quality now — no signup required.