BYOH operator runbook

Day-to-day operations of the bring-your-own-hardware (BYOH) flow. For the architecture / install-modes reference, see docs/otonomo_install_modes_detailed.md. This file is about what you do once it's live.

Daily — when you check Discord / inbox

If DISCORD_WEBHOOK_OPS is wired, ops events appear in your channel:

Embed	Meaning	Action
🗑️ Data-deletion request	Customer wants their data gone	Run the delete procedure below within 30 d (target 7 d)
(more events ship in v0.2)

If Discord is silent, you can still see everything in /admin/audit.

Weekly — the waitlist sweep

Open /admin/waitlist. You see two sections:

Open — entries that haven't been processed yet, newest first.
Dismissed — collapsed; expand to see the audit trail.

For each open entry:

Triage. Does their hardware match something we test against? The "Hardware" column shows their declared kit. Boxes we have stable drivers for (SolarEdge, Vaillant, Daikin, Shelly, Easee, WLBox2, Aurora) are easy wins. Beta-driver hardware (Aquarea Smart Cloud, Huawei, Marstek, etc.) means more support load — prioritise these only if you have bandwidth.
Decide. Invite, queue, or dismiss.
If inviting:
In a new tab, open /admin/boxes → "+ New box". Pre-create a box-id (e.g. hems-900100); leave serial blank for BYOH. Site name = the customer's email is fine (they'll rename in /account later).
Open the box detail page, click "Issue enrollment token", bind it to the customer's email. Copy the token URL.
Back on /admin/waitlist, click the customer's email — opens a mailto: with a prefilled body. Paste the token URL in place of PASTE_TOKEN_HERE. Send.
Click "Mark invited" to move the row to the dismissed section. Audit row records "waitlist.dismiss reason=invited".
If dismissing without inviting:
Click "Dismiss" — same as above but reason="dismissed".
No email is sent; the customer just doesn't hear back.

If you want to automate the email send, point your terminal at /data/waitlist.jsonl, write a small script that calls /admin/boxes/<id>/issue_enroll_token and sends mail via your SMTP. Out of scope for v0.1.

Approving a data-deletion request

Triggered when a customer hits Request deletion at /account/data. You'll see:

Discord ping (if configured): "🗑️ Data-deletion request"
/admin/audit row: action=user.delete_request, target_type=user, detail includes their box-ids + optional reason

GDPR allows 30 days; aim for 7. Procedure:

# On hems-vps. user_id from the audit detail.
ssh hems-vps
sudo docker exec -it fleet-db psql -U fleet_admin -d fleet

-- Confirm scope: count what'd be deleted.
SELECT box_id, count(*)
FROM metrics
WHERE box_id IN ('hems-900001', 'hems-900002')    -- from the audit row
GROUP BY box_id;

-- Then delete in a transaction.
BEGIN;
DELETE FROM metrics    WHERE box_id IN (...);
DELETE FROM daily_money WHERE box_id IN (...);
DELETE FROM controller_decisions WHERE box_id IN (...);
-- Spot-check other tables that key on box_id (alerts, command_replies):
DELETE FROM command_replies WHERE box_id IN (...);
DELETE FROM alerts_24h     WHERE box_id IN (...);
COMMIT;

Then record the action:

sudo docker exec -it ops-db psql -U ops_admin -d ops -c "
INSERT INTO admin_audit (actor, action, target_type, target_id, detail, result)
VALUES (
  'your.email@otonomo.be',
  'user.delete_executed',
  'user',
  '42',                                       -- user_id
  '{\"boxes\": [\"hems-900001\"]}'::jsonb,
  'ok'
);"

Finally, email the customer "your data has been deleted." Their account + box rows stay in auth-db for 30 days (re-onboarding window) before auto-purge.

Note: schema migrations stay manual per CLAUDE.md guardrails. If a deletion request touches a table we haven't accounted for here, surface it before running anything.

Flipping a customer from observe to active (paid)

When a customer pays € 99 + €3/mo (Stripe / Mollie integration is M2 — v0.1 is manual):

Confirm the payment landed (whatever billing tool).
Open /admin/users/<id> (or /admin/boxes/<box_id>).
Click "Flip to active" (existing operator action).
The customer's /account/control page now shows the per-capability toggle UI. They opt in to what they want optimized.
Cloud orchestrators start writing commands within ~60 s.

Audit row: box.mode_flip, before/after in detail.

To flip back to observe (cancellation, dispute, etc.) — same path, "Flip to observe". Cloud stops writing immediately; box keeps running. Customer's local UI is untouched.

Install support — common customer tickets

"Installer says cannot reach cloud at https://app.otonomo.be"

Their box can't reach our HTTPS. Ask them to run:

curl -v https://app.otonomo.be/healthz

Possibilities: - Firewall blocking outbound 443 (most common in office networks) - DNS failure (no /etc/resolv.conf or it points at a captive portal) - No internet at all

Also they need port 8883 outbound for MQTT/TLS telemetry. If they can curl healthz but can't enroll, walk them through testing 8883:

openssl s_client -connect app.otonomo.be:8883 -servername app.otonomo.be

"Installer hung at 'building Python venv'"

This is the slow step — first time, 60-90 s on a Pi Zero 2 W. Tell them to wait, then check journalctl -u otonomo-publisher after.

"Local UI shows 'waiting for first telemetry' forever"

99% of the time: no drivers configured yet. Walk them to http://<their-box-ip>:8080/drivers, pick the driver matching their inverter / boiler / charger, fill in IP + creds.

If they say "but the driver is configured" — check the publisher:

ssh customer-pi  # (their reverse tunnel — see /admin/boxes/<id>)
sudo systemctl status otonomo-publisher
sudo journalctl -u otonomo-publisher --no-pager -n 60

Look for "DeviceUnreachable" — usually a wrong IP or firewall on the inverter's LAN side.

"It worked, then after a power cut the device disappeared"

Most likely the router gave the hardware a new DHCP IP.

Fast path:

Open the customer's /pro/devices page.
Expand IP address for the affected device.
If the box is online, use /onboarding/configure or the add-device flow's Scan LAN from box to find likely candidates.
If the scan shows the same MAC at a new IP, save the new IP in /pro/devices.
Add/save the MAC too if it is visible, so future DHCP changes can recover automatically.

Fresh factory installs after commit a9cef5e have SDK-side DHCP recovery: after three consecutive DeviceUnreachable polls, then at most every five minutes, the publisher scans for the configured host_mac, rewrites the local manifest if the MAC moved, syncs the new host to cloud, and restarts itself.

Caveats:

Existing boxes installed before that payload need a fresh image, re-run installer, or future OTA/SDK update before automatic recovery exists.
Manifests without host_mac cannot recover automatically. Use LAN scan and save the MAC once.
DHCP reservation on the router is still the cleanest long-term fix when the customer can do it.

"I want HA to see my data"

Direct them to installer/ha_integration/README.md. Three-step config-flow setup with their box IP + box-id. Works in all three install modes.

"How do I stop the cloud uplink without uninstalling?"

sudo systemctl stop otonomo-publisher
sudo systemctl disable otonomo-publisher

Their drivers + local UI + HA integration keep working. v0.2 ships a friendlier UI toggle.

Health probes — what to watch

Every few hours, glance at:

/admin/health — fleet-wide rolled-up status
/admin/audit — recent operator + user actions
scripts/validate_deploy.sh — run after any non-trivial change to cloud/app/

Red flags: - >5 ERROR lines in last 5 min in hems-app logs — SSH and tail sudo docker logs hems-app -f - Controller hasn't ticked in >150 s — check fleet-db connectivity, EMQX broker, the controller's own log line - Telemetry stale across all boxes — likely EMQX issue; sudo docker compose -f /opt/hems/broker/docker-compose.yml ps

If something's actually on fire and you need to read the recovery runbook without AI: docs/runbook_disaster_recovery.md (also lives in the fireproof safe).

Multi-session protocol reminder

If multiple Claude sessions are working in parallel, charter rule

11 applies: at session start, the user names the other lanes; on

ANY deploy / git / lock anomaly the session must STOP and surface, never silently merge another session's work. The deploy-cloud / -pi scripts have gates that enforce this at the push layer; the human discipline ensures the conflicts get resolved correctly.

If you (the operator) see a deploy fail with a non-FF push, that's the gate doing its job. Don't bypass with SKIP_LINT=1 unless you truly know what you're doing — re-run after a rebase is almost always the right answer.

Open future-work flags (track here, not in production)

Schema migration for waitlist — currently /data/waitlist.jsonl. When real DB needed: design + migrate, replace _load_waitlist and _append_waitlist in onboarding.py + admin.py.
Email + Stripe integration — manual flip-to-active will stop scaling around the 30th customer. v0.2 territory.
v0.2 per-destination uplink toggle — plan in docs/otonomo_privacy_toggle_v02_plan.md.
HA integration v0.2 — per-capability override switches. Needs a stable HTTP cmd contract on the local UI first.
Driver wheels to PyPI — blocked on BOIP close (~July 2026).