About the GoodHelp-WebExtract crawler

If you found this page in your server logs, you most likely saw a request with this User-Agent header:

GoodHelp-WebExtract/1.0 (+https://goodhelp.ai/bots)

What it is

GoodHelp-WebExtract is a user-initiated content importer, not a broad-web crawler. It runs only when a GoodHelp customer explicitly imports a website—typically their own site, or a site they have permission to mirror—into their organization's content library. It does not crawl the web speculatively, build a search index, or train models on the pages it fetches.

What we use the data for

Pages fetched by GoodHelp-WebExtract are imported into the requesting organization's private content library inside GoodHelp. They are used to power that organization's own marketing, agent, and content-management workflows. Imported content is not shared across organizations.

How to opt out

We honour standard robots.txt directives. To stop GoodHelp-WebExtract from fetching any page on your site, add:

User-agent: GoodHelp-WebExtract
Disallow: /

You can also limit the rate we fetch with Crawl-delay: N (seconds between requests), or scope the disallow to specific path prefixes.

If you would prefer to contact us directly—for example, to report a misbehaving import or to request manual removal—email crawler@goodhelp.ai. Please include the date/time and a sample log line so we can identify the originating import.

Politeness defaults

User-Agent: GoodHelp-WebExtract/1.0 (+https://goodhelp.ai/bots)
Per-host concurrency cap: at most 4 in-flight requests to a single host at a time across our worker fleet.
Crawl delay: 0.5 seconds between requests during a full-site crawl by default. We honour your Crawl-delay directive when it is longer.
Per-import page cap: a single import is hard-capped at 500 pages.
Response-size cap: 10 MB per response. Larger responses are aborted mid-stream.
Redirect handling: at most 10 redirect hops. Each hop is independently re-validated.
HTTP 429 handling: we honour the Retry-After header (delta-seconds or HTTP-date), capped at 5 minutes per response so we do not park indefinitely.
Source IPs: requests originate from Google Cloud (us-west1) egress. We do not publish a static IP allowlist—please identify us by User-Agent.

See also RFC 9309 (Robots Exclusion Protocol) and RFC 7231 §7.1.3 (Retry-After).