About the GoodHelp-WebExtract crawler
If you found this page in your server logs, you most likely saw a request with this User-Agent header:
GoodHelp-WebExtract/1.0 (+https://goodhelp.ai/bots)What it is
GoodHelp-WebExtract is a user-initiated content importer, not a broad-web crawler. It runs only when a GoodHelp customer explicitly imports a website—typically their own site, or a site they have permission to mirror—into their organization's content library. It does not crawl the web speculatively, build a search index, or train models on the pages it fetches.
What we use the data for
Pages fetched by GoodHelp-WebExtract are imported into the requesting organization's private content library inside GoodHelp. They are used to power that organization's own marketing, agent, and content-management workflows. Imported content is not shared across organizations.
How to opt out
We honour standard robots.txt directives. To stop GoodHelp-WebExtract from fetching any page on your site, add:
User-agent: GoodHelp-WebExtract
Disallow: /You can also limit the rate we fetch with Crawl-delay: N (seconds between requests), or scope the disallow to specific path prefixes.
If you would prefer to contact us directly—for example, to report a misbehaving import or to request manual removal—email crawler@goodhelp.ai. Please include the date/time and a sample log line so we can identify the originating import.
Politeness defaults
- User-Agent:
GoodHelp-WebExtract/1.0 (+https://goodhelp.ai/bots) - Per-host concurrency cap: at most 4 in-flight requests to a single host at a time across our worker fleet.
- Crawl delay: 0.5 seconds between requests during a full-site crawl by default. We honour your
Crawl-delaydirective when it is longer. - Per-import page cap: a single import is hard-capped at 500 pages.
- Response-size cap: 10 MB per response. Larger responses are aborted mid-stream.
- Redirect handling: at most 10 redirect hops. Each hop is independently re-validated.
- HTTP 429 handling: we honour the
Retry-Afterheader (delta-seconds or HTTP-date), capped at 5 minutes per response so we do not park indefinitely. - Source IPs: requests originate from Google Cloud (us-west1) egress. We do not publish a static IP allowlist—please identify us by User-Agent.
See also RFC 9309 (Robots Exclusion Protocol) and RFC 7231 §7.1.3 (Retry-After).