All posts
Reliabilityrestore-drillsbackupsreliability

An untested backup is not a backup: the case for restore drills

The only way to know a backup works is to restore it. How automated restore-drills turn a hope into a verified, dated fact.

Sevak Girard· Founder, Girard Media·Jun 25, 2026·6 min read

There is a quiet lie that lives in almost every infrastructure setup. It sounds like this: "We have backups." The cron job runs, the dashboard is green, the bucket fills up with neat little timestamped objects. Everyone moves on. But a backup you have never restored isn't a backup — it's a hypothesis. The only thing that converts that hypothesis into a fact is putting it back.

The green checkmark is not the deliverable

Most backup tooling optimizes for the wrong moment. It tells you that a snapshot was written. That's the easy half. Writing bytes to R2 or S3 is reliable, well-understood, and rarely where disasters come from.

Disasters come from the read half, and the read half is full of quiet ways to fail:

  • The dump completed, but the database was mid-migration and the schema is inconsistent.
  • The archive is fine, but the encryption key you'd need to open it lives only on the box that just died.
  • The volume snapshot captured the container's data directory, but not the named volume it actually mounts.
  • The image restores, but a hardcoded IP, a missing env var, or a stale TLS cert means nothing actually comes up.
  • The backup ran every night for a year — of an empty directory, because a path changed and nobody noticed.

Every one of these passes the "backup succeeded" check. Every one of them fails the only test that matters: can you stand the thing back up and serve traffic? The deliverable was never the snapshot. The deliverable is a running server.

A restore drill, defined

A restore drill is the deliberate, scheduled act of taking a real backup and bringing it back to life somewhere it can be inspected — then recording, with a date, that it worked.

It is not a fire drill where everyone reads the runbook and nods. It's the actual restore: pull the encrypted image, decrypt it with the key you'd really use, rebuild the stack, bring up the services, and check that the application answers. If any step needs a human to "remember" something, that's not a drill, that's a future outage with extra steps.

The output of a drill is small but precious: a timestamp, the backup it restored, the time it took, and a pass/fail. That single dated line is worth more than a thousand green "backup complete" notifications, because it's the only artifact that describes what will happen when you're not ready for it to happen.

Drill to a scratch target, not to prod

The point of a drill is to learn without risk, so restore into disposable space — a throwaway VM, an isolated network, a fresh box you tear down afterward. This is exactly the muscle a relocate is built on: if your tooling can take an encrypted server image and rehydrate it as a whole server somewhere else, then a restore drill and a real migration are the same motion performed for different reasons. You back up, you clone, you relocate. A drill is just a relocate you throw away on purpose.

What a real drill actually verifies

When you restore for real, you stop testing the backup and start testing the entire recovery path. That path includes a lot of things people forget are part of "the backup":

  • Custody and keys. Can you decrypt without the original host? With agent-local custody the key lives with your agent; with zero-knowledge, only you hold it; with escrow, recovery flows through a defined break-glass process. A drill is where you find out whether that process works before the day you need it. If a restore requires a secret nobody can produce, you don't have a backup, you have a locked safe.
  • Completeness. Databases, named volumes, uploaded media, certificates, environment configuration — the drill either brings the app fully back or exposes exactly what's missing.
  • The rebuild. A backup of data still needs an environment to run in. When the recovery path rebuilds the stack — HostPack into Docker behind Traefik — you're verifying that the image plus the build pipeline reconstitutes a working system, not just a folder of files.
  • Time. Your real recovery-time objective is not the number in the runbook; it's the number the stopwatch shows during a drill. Those are usually different. Better to learn that on a calm Tuesday.

Why "your data, your bucket" makes drills honest

Restore drills only tell the truth if you control the whole loop. If your backups live inside a provider's proprietary system, your "drill" can quietly become a test of their restore button — and their restore button only ever puts you back onto their platform. You've verified you can recover, as long as you never want to leave.

That's the trap. Bring-your-own storage flips it. When the encrypted images sit in your own R2 or S3 bucket, a drill becomes a genuine test of your exit: can you, with your credentials and your keys, reconstruct your server independent of where it ran? If the answer is yes, you've verified two things at once — that your backup is real, and that your lock-in is zero. Those turn out to be the same property. A backup you can restore anywhere is also a door nobody can lock from the outside.

Automate the drill, or it won't happen

Manual restore drills suffer the fate of all manual discipline: they happen twice, then never. The honest path is to make the drill a scheduled job, not a calendar reminder.

Conceptually the loop is small:

# nightly: prove the latest image actually comes back
hostssh restore drill --source r2://your-bucket/server-x/latest \
                      --target scratch --teardown-after \
                      --assert http://app.local/healthz=200

Pick the latest backup, restore it to a scratch target, wait for the app to answer a health check, record the result, tear the scratch environment down. If it fails, you hear about it now — while the original is still running and you have all the time in the world to fix the recovery path. If it passes, you get that dated, verified fact, automatically, without anyone having to be brave.

The cadence matters less than the consistency. A weekly drill that always runs beats a "comprehensive quarterly DR test" that keeps slipping. What you're building is not a single heroic recovery rehearsal; it's a steady drip of evidence that the answer to "can we come back?" is still yes.

The takeaway

Backups are a story you tell yourself about the future. A restore drill is the only thing that checks whether the story is true. You don't need certainty about every failure mode — you need one recent, dated line that says we restored this, end to end, and it served traffic. Make that line cheap to produce and produce it often. The goal isn't to feel safe. It's to already know, before anything breaks, that the door out works — and that it opens with your key, into your bucket, on your terms.