hexa.ninja mascot

Introducing null

By Thomas Sileo7 min read

A personal storage system built on content-addressed data.

null is a “personal storage system built on content-addressed data”. It’s the foundation of my self-hosted ecosystem: handling my backups, notes, calendar, and monitoring needs.

I’ve been writing/tinkering with backup tools for years using various programming languages. My previous attempt was blobstash, written in Go and inspired by perkeep, and while I used it for my secondary backups for a while, I never managed to also make it work as a “general document store I can build apps on”.

This time with null, I picked Rust. And it feels like I’ve hit the sweet spot.

I think the reason why is a combination of the experience gained from my previous attempts.

Being able to leverage LLMs also helped, letting me stay focused on architecture and design while moving faster on the implementation side.

Here were my original requirements:

  • I wanted my data to be local first (as I have a bunch of home servers)
  • Encrypted remote storage is for disaster recovery
  • Backing up to remote storage should be incremental
  • It should serve as a building block for my “personal infrastructure/ecosystem”
    • and building applications on top of it should be easy
  • My data should be safe and it should be very hard to make a mistake
  • It should still support some kind of garbage collection to get rid of old data

Content addressing

Content addressing is a big part of the design. The concept is simple: data is identified by its hash. Trying to store the same data will always result in the same hash, it’s a good basis for deduplication.

Any identical piece of data will be deduplicated and stored only once across any snapshots or applications.

Primitives

Null supports only 3 data types: data blobs, trees, and records. All three are stored as content-addressed blobs, with their hash as the unique identifier.

Everything is ultimately a blob. Every operation appends data, everything is immutable.

Data blobs are content-addressed chunks of bytes, think:

  • file chunks
  • document content

Trees hold ordered lists of hashes, which are references to other blobs (including other trees).

They are effectively Merkle trees and are used for:

  • storing chunks of a file (nested tree of blobs, so that tree does not grow too big)
  • file list in a directory (list of records)

Records are pointers to blobs and they allow retrieving data. Records give identity to a piece of data. They also hold structured metadata as a JSON blob.

They are the most important data types, for example:

  • a record can hold file metadata and a pointer to the tree holding chunks of data
  • a record can hold a note with a pointer to the note content

Creating a record generates a UUID.

Updating a record means creating a new one referencing the previous version. They will both share the same UUID, forming a version chain.

Deletion works the same way: a new record marks the UUID as deleted.

Since every version is preserved, it gives history tracking and “time travel” type of queries for free.

One “drawback” is that multiple records pointing to the same data and sharing the same metadata have different hashes, as they will get their own UUID.

This is why records can also be immutable, which matters for efficient deduplication. A snapshot needs a UUID to track its identity, but files/directories within a snapshot should always produce the same hash.

An immutable record is a stripped-down version of a record, with no UUID (its hash is the only way to reference it), and without any time-related metadata (like creation date).

This ensures that the same file will be deduplicated across multiple snapshots.

Indexing

Another big part of null is the indexing layer, built on top of SQLite.

Blobs are stored in a dedicated SQLite database, that’s the main storage/single source of truth.

There’s also a second SQLite database that indexes blobs and records. It’s heavily used by the RESTful API for querying, aggregating, and sorting records.

The index can be fully rebuilt by scanning the blob DB. It also relies heavily on SQLite’s JSON1 extension for records’ metadata.

Disaster Recovery

This design makes it easy to back up the data for disaster recovery, provided by null-sync. It backs up blobs to any S3-compatible APIs, with encryption.

It maintains a small local index (that can also be rebuilt by scanning the remote storage), and it:

  • Packs and encrypts blobs incrementally (to prevent making too many costly requests to the remote storage)
  • Stores small index files, to allow restoring the local index without downloading the whole dataset

A complete restore from remote storage involves downloading and decrypting blobs locally, and letting null re-index everything.

Ecosystem

Backup tooling

I’ve built 0x00 (named after the null byte in hex), a set of CLI tools to manage backups. It’s inspired by Restic’s design, but uses null’s primitives.

The client uses content-defined chunking to enable deduplication at the file level: file content is chunked, each chunk is stored as a data blob, and tracked in recursive trees. A file is then stored as an immutable record, with the content pointing to the tree holding the chunks.

A directory is modeled as an immutable record, with the content pointing to a tree that holds a list of child file/directory records.

A snapshot is a “mutable” record, making it indexed and allowing the garbage collection to mark all the blobs pertaining to the snapshots (as GC walks records’ content and tree recursively).

Unlike notes and calendar events, taking a second snapshot creates a different record instead of “updating” the existing one with a previous record reference, which makes retention policies (e.g. keep 3 dailies, 3 weeklies, and 3 monthlies) more straightforward to implement: it can query all the records representing snapshots for a given host/path, and mark no-longer-needed ones as deleted.

Here’s a preview of the 0x00 interface:

# Backup and restore
0x00 backup <path>                          # create backup snapshot
0x00 restore <id> [--output <path>]         # restore snapshot or git repo (auto-detects kind)
0x00 snapshots [--host H] [--path P]        # list all snapshots and git repos
0x00 delete <snapshot-id>                   # mark snapshot as deleted
0x00 forget --daily N [--weekly N] [...]    # apply retention policy (--dry-run to preview)
0x00 ls <snapshot-id> [path]                # list snapshot contents
0x00 diff <snap-1> <snap-2> [path]          # compare two snapshots (+/-/M/T)

Notes, calendar and events

I’m dogfooding 3 applications using null as a backend.

null-cal is a small CalDAV server. It replaced my use of NextCloud’s CalDAV server.

Records hold event metadata, and the blob is the iCalendar serialization of the events. A small Python API server handles the CalDAV protocol and queries null on the fly.

jot is a note-taking/journaling app. Comes with a CLI and a small web UI. Notes and journal entries are modeled as records.

Records hold metadata (like journal date) and the content is stored as a data blob. The UI allows searching, sorting, and filtering journal entries and notes via the index.

nux is the biggest/most complex application, it’s an event logs/notification service. It has a ntfy compatibility layer to let me receive push notifications on my phone (no cloud involved). It also has a web UI inspired by GCP Logging (which I use at work).

  • Events are stored as records
  • The UI shows a histogram powered by null’s record aggregation, via a rhai script
  • Powerful filtering (event types, priority, structured fields)
    • It even has its own “query language” that gets converted to null’s query for records

These apps validate that records are a powerful enough primitive to build on!

All these apps are built using the Python client (code in the null repo) that offers some nice syntactic sugar for querying:

client.query_records(
    filter=(m("x") > 1 and m("x") < 3) or (m("y") == "value"),
    kind="note",
)

This example gets converted to the JSON wire format for queries and then gets converted to SQL to query records’ metadata via SQLite JSON1 extension.

The Future

It’s still in early development.

I’m heavily dogfooding it and I keep finding edge cases here and there.

At this point, I feel like I’ve built a solid foundation, and I will focus on ironing out rough edges and polishing existing features.

My end goal is to eventually drop Restic, NextCloud, and Syncthing completely and rely only on null.

If you find the ideas interesting, I’d be happy to hear about it!