Last updated May 6, 2026 Give Feedback

Backup & Restore

A backup that’s never restored is not a backup. This page covers what to back up, where to ship it, how often, and — critically — how to verify the restore actually works before you need it in production.

Targets

Metric	Target
RPO (data loss tolerance)	≤ 1 hour — hourly Postgres dumps + WAL streaming
RTO (recovery time)	≤ 30 minutes for DB restore + service restart from a clean VM, given the latest dump in hand
Backup retention	30 daily + 12 monthly + 4 yearly snapshots (off-site)
Encryption at rest	All off-site dumps are AES-256 encrypted via the same `ECOMMUS_DATA_KEY` chain or a dedicated backup KEK
Off-site target	S3-compatible (Backblaze B2 recommended for €0 budget; AWS S3 or Cloudflare R2 also work)

These are defaults for a single-tenant install. Multi-tenant SaaS deployments aiming at higher tiers (Enterprise) need stricter RPO (continuous WAL ship-out) — see ADR-031.

What gets backed up

Source	Where	How
`ecommus_prod` Postgres database	Custom-format dump (`pg_dump -Fc`)	Hourly cron + WAL streaming
`ecommus_licenses` (license-server self-hosted)	Same	Hourly cron
Uploads (`STORAGE_LOCAL_DIR=/opt/ecommus/uploads`)	rsync diff + nightly tarball	Nightly cron
Keys (`/opt/ecommus/keys/*.pem`)	Encrypted tarball, separate target	Once on provision; never again unless rotated
`.env`	Encrypted, separate target	After every change
Caddyfile / Nginx vhosts	Same	After every change

Don’t back up dist/, node_modules/, apps/*/.next, apps/*/.astro, data/db/ (that’s the dev pglite path — production runs Postgres). They’re regenerable from git + npm ci.

Don’t back the keys to the same target as the DB. Compromise of one bucket shouldn’t compromise the other.

Hourly DB dump cron

0 * * * * ecommus /opt/ecommus/scripts/backup.sh >> /var/log/ecommus-backup.log 2>&1

#!/bin/bash
set -euo pipefail

ts=$(date -u +%Y-%m-%dT%H-%M-%SZ)
out=/var/backups/ecommus/$ts

mkdir -p "$out"

# Postgres dump (custom format = compressed + parallel-restore-friendly)
pg_dump -Fc -d "$DATABASE_URL" -f "$out/db.dump"

# License DB if self-hosting license-server
if [ -n "${LICENSE_DATABASE_URL:-}" ]; then
  pg_dump -Fc -d "$LICENSE_DATABASE_URL" -f "$out/licenses.dump"
fi

# Encrypt with the backup KEK (separate from ECOMMUS_DATA_KEY in production)
gpg --batch --yes --symmetric --cipher-algo AES256 \
    --passphrase-file /etc/ecommus/backup-passphrase \
    "$out/db.dump"
rm "$out/db.dump"
[ -f "$out/licenses.dump" ] && gpg --batch --yes --symmetric --cipher-algo AES256 \
    --passphrase-file /etc/ecommus/backup-passphrase \
    "$out/licenses.dump" && rm "$out/licenses.dump"

# Off-site ship via rclone (configure once: `rclone config`)
rclone copy "$out" "ecommus-backup:ecommus-prod/$ts/" \
  --transfers 4 --checkers 8

# Local retention: keep last 24 hourly dumps locally
find /var/backups/ecommus -mindepth 1 -maxdepth 1 -type d -mtime +1 -exec rm -rf {} +

Off-site retention is enforced by the bucket lifecycle rule (rotate to cold storage after 30 days, delete after 1 year — adjust per your compliance requirement).

WAL streaming (RPO < 1 h)

Configure Postgres archive_mode + archive_command to ship WAL segments to the same off-site bucket:

wal_level = replica
archive_mode = on
archive_command = 'rclone copyto %p ecommus-backup:ecommus-prod/wal/%f'
archive_timeout = 300   # 5 min

Together with the hourly pg_dump, this gives you a worst-case 5-min data loss on cold-recovery from the bucket.

Restore — the drill (run this once after provisioning, then monthly)

This is the procedure you run in a clean VM to validate that the backup is actually restorable. Run it on a fresh VM, not the production one.

# 1. Provision a clean Postgres 16 instance
sudo -u postgres createdb ecommus_drill

# 2. Pull the most recent dump from off-site
rclone copy ecommus-backup:ecommus-prod/$(date -u +%Y-%m-%dT%H-)*/db.dump.gpg /tmp/

# 3. Decrypt
gpg --batch --yes --passphrase-file /etc/ecommus/backup-passphrase \
    --decrypt /tmp/db.dump.gpg > /tmp/db.dump

# 4. Restore
pg_restore --clean --if-exists --no-owner \
  --dbname=ecommus_drill /tmp/db.dump

# 5. Sanity-check
psql -d ecommus_drill -c "SELECT count(*) FROM tenants;"
psql -d ecommus_drill -c "SELECT count(*) FROM products;"
psql -d ecommus_drill -c "SELECT count(*) FROM orders WHERE created_at >= now() - interval '1 day';"

# 6. Boot a throwaway API against this DB to confirm the schema is alive
DATABASE_MODE=postgres \
DATABASE_URL=postgres://postgres@localhost/ecommus_drill \
JWT_ACCESS_SECRET=drill-only-not-real-not-real-not-real \
JWT_REFRESH_SECRET=drill-only-not-real-not-real-not-real \
ECOMMUS_LICENSE_JWT=<dev-license-from-license-server> \
  node --experimental-strip-types apps/api/src/server.ts &
sleep 3
curl http://localhost:4000/health
kill %1

If step 5 row counts are non-zero and step 6’s /health returns ok:true, the backup is restorable.

Do this monthly. A backup you’ve never restored is Schrödinger’s backup — it both works and doesn’t until you check.

Restore — production emergency

The shape is the same, in production:

# 1. Stop all four services (prevent in-flight writes during restore)
pm2 stop all

# 2. Pull + decrypt the dump (latest hourly + WAL replay if needed)
rclone copy ecommus-backup:ecommus-prod/<ts>/db.dump.gpg /tmp/
gpg --decrypt --batch --passphrase-file /etc/ecommus/backup-passphrase \
    /tmp/db.dump.gpg > /tmp/db.dump

# 3. Restore. --clean drops + recreates objects; existing connections must be gone.
pg_restore --clean --if-exists --no-owner \
  --dbname=$DATABASE_URL /tmp/db.dump

# 4. If WAL replay is needed (recover further past the dump):
#    Configure recovery.signal + restore_command to fetch from rclone bucket.
#    See https://www.postgresql.org/docs/16/continuous-archiving.html

# 5. Restart
pm2 start all

# 6. Health
curl https://api.mystore.ro/health

Per the RTO target above, this should land in ≤ 30 minutes assuming the dump is already in hand and the VM is provisioned. If you need to rebuild the VM from scratch first, budget ≥ 2 hours.

Encryption-at-rest considerations

The Postgres rows include columns encrypted via ECOMMUS_DATA_KEY envelope encryption (Phase 0 §1.6) — payment_methods.config, settings.value (ANAF tokens). These columns are encrypted on-disk and in the dump. Restoring on a VM with a different ECOMMUS_DATA_KEY will leave those columns unreadable.

If you’re rotating ECOMMUS_DATA_KEY (per ADR-028 — Phase 0):

Restore the dump with the old ECOMMUS_DATA_KEY first.
Run the re-key migration: node --experimental-strip-types apps/api/src/cli/rotate-data-key.ts --new-key=<new-hex>
Update .env with the new key; restart the API.

Don’t lose ECOMMUS_DATA_KEY. There is no recovery from a lost master key — encrypted columns are unrecoverable, full stop. Back the key up to a separate target (e.g. password manager + sealed envelope in a safe).

Off-site target choice

Target	Cost (~10 GB/mo)	EU region available	Notes
Backblaze B2	~€0.05 + egress	✓	Recommended for €0-cap installs. Compatible with rclone S3 driver.
Cloudflare R2	~€0.15 + free egress to Cloudflare	✓	Good if already on Cloudflare.
AWS S3	~€0.23 + egress	✓	Most expensive, most ubiquitous tooling.
Hetzner Storage Box	€3.50/mo flat	✓	Approved for ecommus per ADR (€5 spending cap, 2026-05-02). Generous.

The framework only needs an rclone-compatible target. The bucket lifecycle / retention policy is configured at the provider, not in ecommus.

Anti-patterns

DB dump on the same disk as the running DB. Disk failure takes both. The hourly cron above writes locally first, then rclone copy ships off-site — local files are deleted after 24 h.
Dumps without encryption. A leaked S3 bucket is a leaked customer database. Always GPG-encrypt.
No restore drill. “Backups exist” is not the same as “backups work”. Drill monthly.
Same passphrase as ECOMMUS_DATA_KEY for backup encryption. Domain isolation. Use a dedicated backup passphrase.
Backing up node_modules/ or dist/. Wastes bandwidth + storage. They’re rebuildable.

Backup & Restore

Backup & Restore

Targets

What gets backed up

Hourly DB dump cron

WAL streaming (RPO < 1 h)

Restore — the drill (run this once after provisioning, then monthly)

Restore — production emergency

Encryption-at-rest considerations

Off-site target choice

Anti-patterns

See also