+1 WIS

On Restores (and Backups, and the cobbler's shoes)

"Only wimps use tape backup. REAL men just upload their important stuff on ftp and let the rest of the world mirror it." ― Linus Torvalds

It's holiday season, AKA the season of break-ins, fried electronics and exhausted angry humans making mistakes. But you have backups, right? Right?

People (IMO) fall into two categories: either YOLO no backups no restores, or something costly and so complicated that's probably broken already on day 2 due to lack of maintenance. (Especially for professionals.)

If you're making a new backup scheme, here's my advice: consider how restores will go and let that lead the design from the start!

Below is my approach for my own data. It's a rant about keeping it simple, and engineering it from first principles.

Consider your data

Mine is this:

  1. bulk: a couple of TB of slow-changing photos, videos and other semi-irreplaceable files (yet I'd probably live if I lost it all)
  2. critical: a few MB of metadata, passwords and the like (I really don't want to lose this)
  3. cloud: some cloud providers such as Google have the primary copy of much of my data
  4. most of not-cloud is on sometimes-on devices (ie laptops) but I only have a small handful of them
  5. system data is irrelevant as it's NixOS with Git, backed up locally and remotely
  6. embedded things like my router get config dumps into a local file after every config change

Consider your threat model

YMMV, but mine is the following:

  1. Someone breaking in and stealing ALL my hardware, or a fire taking it all out
  2. Google locking me out of my account
  3. Me screwing up and fat-fingering something
  4. Silent data corruption / hardware failure

No state actors, no evil maids.

Consider your restores

The threat model means I will need both full-site restores (bulk data, hw failure), and point-in-time from a key set of files (critical data, fat-fingering things without noticing).

Retention is "as long as I can without running out of space" which is quite long considering the bulk data doesn't change often.

Write down your restore processes in advance, then test them. For example:

Restoring a full node or full site

  1. Reinstall NixOS based on my flakes
  2. rsync the latest copy of my home dir back, this includes dotfiles
  3. Rinse and repeat for all hosts as needed
  4. Optionally, restore embedded devices from their config dumps

Restoring select files, point-in-time

  1. Locate the nearest PIT directory of duckup backups in the filesystem
  2. rsync whatever file I need, back to where I need it

Living on after Google

  1. Install whatever can read docx files and the like that come out of the Gdrive backup
  2. Make some angry sounds, move on

Add in a few guidelines

Mine is

  1. It's gotta be cheap and simple. I can't be arsed to maintain something complicated at home, and I want restores to be easily doable when all my infra has failed.
  2. I need to test restores regularly, because nobody needs backups, we all need restores.
  3. As automated as it gets without going overboard (I run one script per device -- full control over timing but otherwise nothing to remember)
  4. Full disk encryption locally is good enough for bulk data. Anything offsite needs to be encrypted first -- no trusting some cloud provider's keys.

Which then results in an approach

  1. rclone can back up Google Drive locally, results get backed up on a second local machine (but not offsite again as it already has an offsite copy)
  2. "bulk" local data gets a second copy on a second local machine
  3. "critical" data gets (locally-encrypted) cloud backups in addition
  4. I prefer raw files to archives, because restoring becomes trivial when in panic mode. Thus I wrote a tool called duckup that does a simple and cheap "historical backups via rsync" thing.
  5. A recurring Google Calendar task (!) is used as a soft scheduler to tell me to run the scripts when convenient.

What's work in progress

  1. Not all of my Google life is covered: I've found no good way to back up Google Photos, and I should probably set up something IMAP or API-based for Gmail and Calendar. There's Takeout but it's tedious and manual.
  2. Offsite backups in the cloud are expensive on the TB level, so I'm still working that one out.
  3. Restore testing is limited to the occasional manual test, but I don't have enough hardware to actually do a full site restore.
  4. Some other cloud services I use lack backups. I managed to get a track list out of Spotify, but not the files. Maybe Anna will sort this for me.