Data format
The URI tree kumo writes: where files land, the front-matter schema, and how to consume it.
A crawl writes one file per page into a tree under the data directory ($HOME/data
by default, or wherever --out points). The tree is the output: it is meant to
be read, grepped, diffed, and committed, not just produced.
Where a page lands
Each page becomes a Markdown file at pages/<host>/<path>.md. The path mirrors
the URL, with two rules:
- A directory URL (one ending in
/, including the host root) is written asindex.mdin that directory.https://example.com/becomespages/example.com/index.md;https://example.com/docs/becomespages/example.com/docs/index.md. - A URL with a query string folds the query into the filename so two pages that differ only by query do not collide.
$HOME/data/
pages/
quotes.toscrape.com/
index.md
login.md
author/
Albert-Einstein.md
tag/
life/
page/
1/
index.md
The same layout is addressable as a pages://<host>/<path> URI, which is how
on-host links are recorded inside each file (see below).
The file format
Every file is a JSON front-matter block fenced by ---json and ---, followed
by the page content as Markdown:
---json
{
"@id": "pages://quotes.toscrape.com/author/Albert-Einstein",
"@type": "pages/page",
"@fetched": "2026-06-14T02:34:03Z",
"url": "https://quotes.toscrape.com/author/Albert-Einstein",
"host": "quotes.toscrape.com",
"status": 200,
"title": "Quotes to Scrape",
"description": "Born: March 14, 1879 in Ulm, Germany",
"lang": "en",
"author": "Albert Einstein",
"links": [
{ "uri": "pages://quotes.toscrape.com/", "anchor": "Quotes to Scrape" }
],
"hash": "…",
"source": "extracted"
}
---
The main content of the page, as Markdown.
The front-matter carries the structured record; the converted Markdown is the body. The body is not repeated inside the front-matter, so the file reads cleanly and the content lives in exactly one place.
Front-matter fields
| Field | Meaning |
|---|---|
@id |
The page URI, pages://<host>/<path> |
@type |
pages/page, or pages/unchanged for a conditional-GET 304 |
@fetched |
When the page was fetched (RFC 3339) |
url |
The absolute source URL |
host |
The crawled host |
status |
HTTP status code |
title, description |
Page title and meta description |
lang |
Document language |
canonical |
The page's declared canonical URL, if any |
author, site_name |
From metadata and OpenGraph |
published, modified |
Article dates, if declared |
og |
OpenGraph properties, keyed by their full og: name |
jsonld |
Raw JSON-LD blocks found on the page |
links |
Outbound links: on-host as pages:// URIs, off-host as absolute URLs |
hash |
SHA-256 of the body, for change detection |
source |
How the body was produced: extracted or raw-md |
Pages with no body (an error, a redirect, a non-HTML response, or a 304) are not written, so re-crawling does not churn the tree with empty files.
Consuming the tree
Because the format is plain files, the usual tools work. Pull a field out of every page with a front-matter aware query, or read the records back through kumo itself:
# Every page already crawled for a host, as JSON, offline:
kumo pages quotes.toscrape.com -o jsonl | jq -r '.url + "\t" + .title'
# Find pages that link somewhere specific:
kumo pages quotes.toscrape.com -o jsonl \
| jq -r 'select(.links[]?.uri == "pages://quotes.toscrape.com/login") | .url'
The pages command reads the tree without touching the network, so it stays
useful long after a crawl. Since the tree is just files, it also versions
cleanly: commit $HOME/data and each crawl becomes a diff you can review.