Introduction
What kumo is and how it is put together.
Crawl a whole host into structured data.
kumo is a single binary. Point it at a host and it fetches every page, converts each one to clean Markdown and JSON, and writes the result as a navigable URI tree under your data directory. There is nothing to sign up for and nothing to run alongside it.
How it is built
- An extract pipeline (
extract) turns each HTML response into a structured record: title, description, canonical, language, dates, OpenGraph, JSON-LD, outbound links, and the main content as Markdown. - A crawl engine (
crawl) walks a host: a self-closing frontier, a polite paced fetcher, robots.txt and sitemap handling, and the scope rules that keep a crawl on one host. - A library package (
kumo) is the glue: it builds a crawl, mints a URI for each page, and writes the URI tree store. - A command tree (
cli) wraps the library in subcommands with shared output formats and flags, and onecmd/kumoentry point ties it together.
Scope and manners
kumo reads only what a host already serves publicly. It honors robots.txt and its crawl-delay by default, paces its requests, and sends an honest User-Agent. A crawl is bound to one host and you bound it further with page, depth, and path limits.
Next: install it, then take the quick start.