LAST REVIEWED · 2026-05-28 · SH

Container-image forensics is the new doc-dump.

A vendor's sales deck is a marketing document. A vendor's container image is the application.

The deck tells you what the vendor wants you to think the service does. The image tells you, line by line, what the service actually does — what binaries it ships, what models it loads, what endpoints it dials home to, what tables it expects to find in the database, what default credentials the developer forgot to remove before pushing to the public registry on a Friday afternoon. The deck takes a week to read between the lines. The image takes about an afternoon, if you know where to look.¹

I have spent more of the last twelve months pulling images apart than reading documents. I want to write down why, and how, because I do not think enough beat reporters know this is now an option.

the doc-dump era is ending, slowly

For most of my career, the canonical investigative artifact was a stack of paper or a directory of PDFs, scanned, OCR'd, half-redacted, tagged with a Bates number. The work was reading. The skill was patience. The accountability lever was: the documents say one thing, the vendor said another, here is the gap.

Documents are still the floor. They will not go away. But the proportion of what a system actually does that is recorded in document form — RFPs, contracts, training manuals, technical schematics — is shrinking. The application code is now the system. And the application code, increasingly, ships as a Docker image or an OCI bundle published to a public registry, often by mistake, often by laziness, often because the vendor's own deploy pipeline depends on it being public.

This is good news for journalism. It is not good news that requires a degree in distributed systems to act on.

what an image actually is

A container image is a stack of filesystem layers, gzipped, addressed by sha256, with a JSON manifest at the top describing how they're meant to be assembled. You can pull one from a public registry with a single command. You can unpack the layers with skopeo copy or docker save and a tar extractor. You can grep the result like any other directory tree. The image is a complete, frozen snapshot of the runtime — every binary, every config file, every shell script, every dependency lockfile, every environment variable the maintainer baked in instead of mounting at runtime.

Three things are usually in there that the sales deck will not tell you.

First, the entrypoint. The exact command the container runs on startup. Often a wrapper script that exposes — in plain text — the order in which services are dialed, the URLs of internal APIs, the names of feature flags, the path to the model weights, the username and password defaults if you forgot to override them.

Second, the dependency tree. The lockfile. requirements.txt, package-lock.json, Cargo.lock, go.sum. Every third-party library the application uses, pinned to a version. From the lockfile you can tell what kind of application this is in about ninety seconds. A corrections-tech "AI risk-scoring tool" whose lockfile leans on scikit-learn==0.24 and a single XGBoost classifier is not running a foundation model. It is running a decision tree trained in 2021. The deck will not tell you that. The lockfile will.

Third, the leftovers. Developers ship things they did not mean to ship. A .env.example that has real keys in it. A seed.sql that contains a small slice of production data. A debug script that hits the staging API with a bearer token still in the comments. Test fixtures that name actual contracting agencies. The image is a snapshot of the moment the maintainer ran docker build; whatever was in the working directory is in the image.

a worked example — the corrections-tech vendor

I will not name the vendor here. They are a mid-sized contractor selling a "behavioral analytics platform" to county jails. Their public-facing material describes a system that "uses advanced machine learning to identify at-risk individuals and route them to appropriate services." Three claims in that sentence. None of them are clearly defined.

I pulled the public image from a registry that should not have been public — they had set the repository to private weeks ago, but a CI mirror under a separate account was still pushing the same digests, and the mirror had been left open. Forty-eight layers. About 1.4 gigabytes after decompression.

Here is what was in there.

The "machine learning" is, in fact, a logistic regression trained on a flat CSV of historical jail bookings. The CSV is in the image. It has 28,000 rows and 14 columns, and three of the columns are demographic variables the deck explicitly claims the model does not use. The model is not retrained on customer data; the model is shipped frozen, and every customer hits the same logistic regression. The "advanced" part of the system is a templating engine that turns the model's output into a recommendation string. There is no foundation model. There is no neural network. There is one .pkl file, ninety kilobytes, and a Jinja template.

The deck would have me believe I was looking at adaptive, jurisdiction-specific decision support. The image tells me I am looking at a 2019-era classifier in a 2026 dress.

I did not get this from a whistleblower. I got it from skopeo.

the floor — what makes this responsible reporting

Reading an image is not the story. Reading an image is the start of the story. Everything I pulled out of the image went into the standard corroboration discipline: three independent sources, two channels, confidence scalar above 0.60 before anything gets published. The image is one source. The vendor's own public statements are another. A second image — same vendor, an older version pulled from a separate registry — is a third. Patent filings, court records where the system has been used in proceedings, FOIA returns of the actual contracts: each adds a channel.

The thing I am very careful about: an image is evidence of what the application was at the moment the image was built. It is not evidence of what the application is right now, on any given customer's hardware, with any given customer's configuration. The image is a strong prior. It is not a verdict.

The right-of-reply rule still applies. If the subject of the report is a person, I write them first. If the subject is a vendor, I write counsel and the vendor's listed press contact, and I show them, in the letter, exactly what I extracted and what I'm going to publish, with citations. Most of the time they answer. Sometimes they correct a real mistake. Once they have sent counsel a takedown demand that contradicted, in writing, what their own image said. That was a useful letter to receive.

the toolchain, deliberately boring

You do not need a Bittensor subnet to do this. You do not need a GPU. The tools are: skopeo (to pull without a Docker daemon), dive (to walk the layer tree visually), trivy or grype (to enumerate dependencies), ripgrep (to grep the unpacked tree), and a notebook to write down what you found and where. All free. All running on a laptop. The validator work I do on subnet 21 publishes its rubric in plain Rust for the same reason: methods are not interesting unless they are reproducible. If a beat reporter at a county weekly cannot follow the steps, the method is not a method, it is a parlor trick.

The day-to-day work is a directory of unpacked filesystems, a long markdown file of notes, a list of grep patterns I run against every new image, and a folder of small Python scripts that pull the dependency manifests out of a layer. It is closer to library science than to hacking. The "AI" part — using a local model to summarize a 4,000-line shell script so I can decide whether to read it in full — is a labor multiplier, not the analysis.

what to do if you are a beat reporter

Pick one vendor on your beat. Search the public registries — Docker Hub, GHCR, Quay, GitLab Container Registry, the AWS public ECR gallery — for anything that matches the vendor's product name, their parent company, their CI org, their developers' personal handles. You will be surprised how often something is there. Pull it. Unpack it. Look at the entrypoint, the lockfile, and the leftovers. Then go find the contract.

Most of this work is paperwork, postage, and patience. It does not photograph well. But the gap between what a vendor sells and what a vendor ships is the story, every time, and the image is the most efficient way I have found to see the gap on the record. I am embarrassed by how much more is sitting in plain sight in public registries than I would have guessed before I started looking.

  1. Container registries are public by default for many CI pipelines. The observation is reproducible with skopeo + dive on any public image; no special access required. See methodology for exact commands and corroboration ledger.