Troubleshooting

The handful of things that trip people up, and how to fix each one.

Most of these come down to network reality or how a host serves its pages, not a bug.

Requests start failing or returning 429

A public host rate-limits like any other. kumo already paces requests and retries the transient failures, but a hard limit still means backing off. Raise the delay between requests with --rate (for example --rate 2s), lower the worker count with -j, and retry later. A burst of 429 or 5xx responses is the host asking you to slow down, not a defect.

A crawl stops earlier than you expected

A crawl is bound to one host and to your limits. If it ends with fewer pages than you thought, check that robots.txt does not disallow the paths (run with --no-robots to confirm), that --max-pages, --max-depth, and --scope-prefix are not cutting it short, and that the links you want are on-host. Subdomains need --include-subdomains, and site-search routes need --include-search.

A page comes back with no content

kumo extracts the main readable content and drops site chrome. A page that is mostly navigation, a redirect, a non-HTML response, or one that only renders with JavaScript can yield an empty body, and an empty body is not written to the tree. For documentation sites that publish a Markdown source, --raw-md prefers the site's own <page>.md.

The binary is not on your PATH

go install puts the binary in $(go env GOPATH)/bin (usually ~/go/bin), and a release archive leaves it wherever you unpacked it. If your shell cannot find kumo, add that directory to your PATH. See installation.

Seeing what kumo actually did

When something behaves unexpectedly, -v adds per-request detail so you can see the URLs it hit and the responses it got. That is usually enough to tell a rate limit apart from a genuinely empty result.