Backup & Restore
Backup & Restore
Section titled “Backup & Restore”A backup that’s never restored is not a backup. This page covers what to back up, where to ship it, how often, and — critically — how to verify the restore actually works before you need it in production.
Targets
Section titled “Targets”| Metric | Target |
|---|---|
| RPO (data loss tolerance) | ≤ 1 hour — hourly Postgres dumps + WAL streaming |
| RTO (recovery time) | ≤ 30 minutes for DB restore + service restart from a clean VM, given the latest dump in hand |
| Backup retention | 30 daily + 12 monthly + 4 yearly snapshots (off-site) |
| Encryption at rest | All off-site dumps are AES-256 encrypted via the same ECOMMUS_DATA_KEY chain or a dedicated backup KEK |
| Off-site target | S3-compatible (Backblaze B2 recommended for €0 budget; AWS S3 or Cloudflare R2 also work) |
These are defaults for a single-tenant install. Multi-tenant SaaS deployments aiming at higher tiers (Enterprise) need stricter RPO (continuous WAL ship-out) — see ADR-031.
What gets backed up
Section titled “What gets backed up”| Source | Where | How |
|---|---|---|
ecommus_prod Postgres database | Custom-format dump (pg_dump -Fc) | Hourly cron + WAL streaming |
ecommus_licenses (license-server self-hosted) | Same | Hourly cron |
Uploads (STORAGE_LOCAL_DIR=/opt/ecommus/uploads) | rsync diff + nightly tarball | Nightly cron |
Keys (/opt/ecommus/keys/*.pem) | Encrypted tarball, separate target | Once on provision; never again unless rotated |
.env | Encrypted, separate target | After every change |
| Caddyfile / Nginx vhosts | Same | After every change |
Don’t back up dist/, node_modules/, apps/*/.next, apps/*/.astro, data/db/ (that’s the dev pglite path — production runs Postgres). They’re regenerable from git + npm ci.
Don’t back the keys to the same target as the DB. Compromise of one bucket shouldn’t compromise the other.
Hourly DB dump cron
Section titled “Hourly DB dump cron”0 * * * * ecommus /opt/ecommus/scripts/backup.sh >> /var/log/ecommus-backup.log 2>&1#!/bin/bashset -euo pipefail
ts=$(date -u +%Y-%m-%dT%H-%M-%SZ)out=/var/backups/ecommus/$ts
mkdir -p "$out"
# Postgres dump (custom format = compressed + parallel-restore-friendly)pg_dump -Fc -d "$DATABASE_URL" -f "$out/db.dump"
# License DB if self-hosting license-serverif [ -n "${LICENSE_DATABASE_URL:-}" ]; then pg_dump -Fc -d "$LICENSE_DATABASE_URL" -f "$out/licenses.dump"fi
# Encrypt with the backup KEK (separate from ECOMMUS_DATA_KEY in production)gpg --batch --yes --symmetric --cipher-algo AES256 \ --passphrase-file /etc/ecommus/backup-passphrase \ "$out/db.dump"rm "$out/db.dump"[ -f "$out/licenses.dump" ] && gpg --batch --yes --symmetric --cipher-algo AES256 \ --passphrase-file /etc/ecommus/backup-passphrase \ "$out/licenses.dump" && rm "$out/licenses.dump"
# Off-site ship via rclone (configure once: `rclone config`)rclone copy "$out" "ecommus-backup:ecommus-prod/$ts/" \ --transfers 4 --checkers 8
# Local retention: keep last 24 hourly dumps locallyfind /var/backups/ecommus -mindepth 1 -maxdepth 1 -type d -mtime +1 -exec rm -rf {} +Off-site retention is enforced by the bucket lifecycle rule (rotate to cold storage after 30 days, delete after 1 year — adjust per your compliance requirement).
WAL streaming (RPO < 1 h)
Section titled “WAL streaming (RPO < 1 h)”Configure Postgres archive_mode + archive_command to ship WAL segments to the same off-site bucket:
wal_level = replicaarchive_mode = onarchive_command = 'rclone copyto %p ecommus-backup:ecommus-prod/wal/%f'archive_timeout = 300 # 5 minTogether with the hourly pg_dump, this gives you a worst-case 5-min data loss on cold-recovery from the bucket.
Restore — the drill (run this once after provisioning, then monthly)
Section titled “Restore — the drill (run this once after provisioning, then monthly)”This is the procedure you run in a clean VM to validate that the backup is actually restorable. Run it on a fresh VM, not the production one.
# 1. Provision a clean Postgres 16 instancesudo -u postgres createdb ecommus_drill
# 2. Pull the most recent dump from off-siterclone copy ecommus-backup:ecommus-prod/$(date -u +%Y-%m-%dT%H-)*/db.dump.gpg /tmp/
# 3. Decryptgpg --batch --yes --passphrase-file /etc/ecommus/backup-passphrase \ --decrypt /tmp/db.dump.gpg > /tmp/db.dump
# 4. Restorepg_restore --clean --if-exists --no-owner \ --dbname=ecommus_drill /tmp/db.dump
# 5. Sanity-checkpsql -d ecommus_drill -c "SELECT count(*) FROM tenants;"psql -d ecommus_drill -c "SELECT count(*) FROM products;"psql -d ecommus_drill -c "SELECT count(*) FROM orders WHERE created_at >= now() - interval '1 day';"
# 6. Boot a throwaway API against this DB to confirm the schema is aliveDATABASE_MODE=postgres \DATABASE_URL=postgres://postgres@localhost/ecommus_drill \JWT_ACCESS_SECRET=drill-only-not-real-not-real-not-real \JWT_REFRESH_SECRET=drill-only-not-real-not-real-not-real \ECOMMUS_LICENSE_JWT=<dev-license-from-license-server> \ node --experimental-strip-types apps/api/src/server.ts &sleep 3curl http://localhost:4000/healthkill %1If step 5 row counts are non-zero and step 6’s /health returns ok:true, the backup is restorable.
Do this monthly. A backup you’ve never restored is Schrödinger’s backup — it both works and doesn’t until you check.
Restore — production emergency
Section titled “Restore — production emergency”The shape is the same, in production:
# 1. Stop all four services (prevent in-flight writes during restore)pm2 stop all
# 2. Pull + decrypt the dump (latest hourly + WAL replay if needed)rclone copy ecommus-backup:ecommus-prod/<ts>/db.dump.gpg /tmp/gpg --decrypt --batch --passphrase-file /etc/ecommus/backup-passphrase \ /tmp/db.dump.gpg > /tmp/db.dump
# 3. Restore. --clean drops + recreates objects; existing connections must be gone.pg_restore --clean --if-exists --no-owner \ --dbname=$DATABASE_URL /tmp/db.dump
# 4. If WAL replay is needed (recover further past the dump):# Configure recovery.signal + restore_command to fetch from rclone bucket.# See https://www.postgresql.org/docs/16/continuous-archiving.html
# 5. Restartpm2 start all
# 6. Healthcurl https://api.mystore.ro/healthPer the RTO target above, this should land in ≤ 30 minutes assuming the dump is already in hand and the VM is provisioned. If you need to rebuild the VM from scratch first, budget ≥ 2 hours.
Encryption-at-rest considerations
Section titled “Encryption-at-rest considerations”The Postgres rows include columns encrypted via ECOMMUS_DATA_KEY envelope encryption (Phase 0 §1.6) — payment_methods.config, settings.value (ANAF tokens). These columns are encrypted on-disk and in the dump. Restoring on a VM with a different ECOMMUS_DATA_KEY will leave those columns unreadable.
If you’re rotating ECOMMUS_DATA_KEY (per ADR-028 — Phase 0):
- Restore the dump with the old
ECOMMUS_DATA_KEYfirst. - Run the re-key migration:
node --experimental-strip-types apps/api/src/cli/rotate-data-key.ts --new-key=<new-hex> - Update
.envwith the new key; restart the API.
Don’t lose ECOMMUS_DATA_KEY. There is no recovery from a lost master key — encrypted columns are unrecoverable, full stop. Back the key up to a separate target (e.g. password manager + sealed envelope in a safe).
Off-site target choice
Section titled “Off-site target choice”| Target | Cost (~10 GB/mo) | EU region available | Notes |
|---|---|---|---|
| Backblaze B2 | ~€0.05 + egress | ✓ | Recommended for €0-cap installs. Compatible with rclone S3 driver. |
| Cloudflare R2 | ~€0.15 + free egress to Cloudflare | ✓ | Good if already on Cloudflare. |
| AWS S3 | ~€0.23 + egress | ✓ | Most expensive, most ubiquitous tooling. |
| Hetzner Storage Box | €3.50/mo flat | ✓ | Approved for ecommus per ADR (€5 spending cap, 2026-05-02). Generous. |
The framework only needs an rclone-compatible target. The bucket lifecycle / retention policy is configured at the provider, not in ecommus.
Anti-patterns
Section titled “Anti-patterns”- DB dump on the same disk as the running DB. Disk failure takes both. The hourly cron above writes locally first, then
rclone copyships off-site — local files are deleted after 24 h. - Dumps without encryption. A leaked S3 bucket is a leaked customer database. Always GPG-encrypt.
- No restore drill. “Backups exist” is not the same as “backups work”. Drill monthly.
- Same passphrase as
ECOMMUS_DATA_KEYfor backup encryption. Domain isolation. Use a dedicated backup passphrase. - Backing up
node_modules/ordist/. Wastes bandwidth + storage. They’re rebuildable.
See also
Section titled “See also”- Customer Onboarding — initial backup setup is part of provisioning
- Manual install (advanced) — where the cron + scripts get installed
- Upgrade procedure — pre-upgrade snapshot is a special case of this flow
- Environment Variables —
ECOMMUS_DATA_KEY, storage env, etc. - ADR-028 (Phase 0) — column-level envelope encryption
- ADR-031 — multi-tenant continuous WAL ship-out (Enterprise tier)