Skip to content
kumo

Data format

The URI tree kumo writes: where files land, the front-matter schema, and how to consume it.

A crawl writes one file per page into a tree under the data directory ($HOME/data by default, or wherever --out points). The tree is the output: it is meant to be read, grepped, diffed, and committed, not just produced.

Where a page lands

Each page becomes a Markdown file at pages/<host>/<path>.md. The path mirrors the URL, with two rules:

  • A directory URL (one ending in /, including the host root) is written as index.md in that directory. https://example.com/ becomes pages/example.com/index.md; https://example.com/docs/ becomes pages/example.com/docs/index.md.
  • A URL with a query string folds the query into the filename so two pages that differ only by query do not collide.
$HOME/data/
  pages/
    quotes.toscrape.com/
      index.md
      login.md
      author/
        Albert-Einstein.md
      tag/
        life/
          page/
            1/
              index.md

The same layout is addressable as a pages://<host>/<path> URI, which is how on-host links are recorded inside each file (see below).

The file format

Every file is a JSON front-matter block fenced by ---json and ---, followed by the page content as Markdown:

---json
{
  "@id": "pages://quotes.toscrape.com/author/Albert-Einstein",
  "@type": "pages/page",
  "@fetched": "2026-06-14T02:34:03Z",
  "url": "https://quotes.toscrape.com/author/Albert-Einstein",
  "host": "quotes.toscrape.com",
  "status": 200,
  "title": "Quotes to Scrape",
  "description": "Born: March 14, 1879 in Ulm, Germany",
  "lang": "en",
  "author": "Albert Einstein",
  "links": [
    { "uri": "pages://quotes.toscrape.com/", "anchor": "Quotes to Scrape" }
  ],
  "hash": "…",
  "source": "extracted"
}
---

The main content of the page, as Markdown.

The front-matter carries the structured record; the converted Markdown is the body. The body is not repeated inside the front-matter, so the file reads cleanly and the content lives in exactly one place.

Front-matter fields

Field Meaning
@id The page URI, pages://<host>/<path>
@type pages/page, or pages/unchanged for a conditional-GET 304
@fetched When the page was fetched (RFC 3339)
url The absolute source URL
host The crawled host
status HTTP status code
title, description Page title and meta description
lang Document language
canonical The page's declared canonical URL, if any
author, site_name From metadata and OpenGraph
published, modified Article dates, if declared
og OpenGraph properties, keyed by their full og: name
jsonld Raw JSON-LD blocks found on the page
links Outbound links: on-host as pages:// URIs, off-host as absolute URLs
hash SHA-256 of the body, for change detection
source How the body was produced: extracted or raw-md

Pages with no body (an error, a redirect, a non-HTML response, or a 304) are not written, so re-crawling does not churn the tree with empty files.

Consuming the tree

Because the format is plain files, the usual tools work. Pull a field out of every page with a front-matter aware query, or read the records back through kumo itself:

# Every page already crawled for a host, as JSON, offline:
kumo pages quotes.toscrape.com -o jsonl | jq -r '.url + "\t" + .title'

# Find pages that link somewhere specific:
kumo pages quotes.toscrape.com -o jsonl \
  | jq -r 'select(.links[]?.uri == "pages://quotes.toscrape.com/login") | .url'

The pages command reads the tree without touching the network, so it stays useful long after a crawl. Since the tree is just files, it also versions cleanly: commit $HOME/data and each crawl becomes a diff you can review.