BYOH operator runbook
Day-to-day operations of the bring-your-own-hardware (BYOH) flow.
For the architecture / install-modes reference, see
docs/otonomo_install_modes_detailed.md. This file is about what
you do once it's live.
Daily โ when you check Discord / inbox
If DISCORD_WEBHOOK_OPS is wired, ops events appear in your channel:
| Embed | Meaning | Action |
|---|---|---|
| ๐๏ธ Data-deletion request | Customer wants their data gone | Run the delete procedure below within 30 d (target 7 d) |
| (more events ship in v0.2) |
If Discord is silent, you can still see everything in /admin/audit.
Weekly โ the waitlist sweep
Open /admin/waitlist. You see two sections:
- Open โ entries that haven't been processed yet, newest first.
- Dismissed โ collapsed; expand to see the audit trail.
For each open entry:
- Triage. Does their hardware match something we test against? The "Hardware" column shows their declared kit. Boxes we have stable drivers for (SolarEdge, Vaillant, Daikin, Shelly, Easee, WLBox2, Aurora) are easy wins. Beta-driver hardware (Aquarea Smart Cloud, Huawei, Marstek, etc.) means more support load โ prioritise these only if you have bandwidth.
- Decide. Invite, queue, or dismiss.
- If inviting:
- In a new tab, open
/admin/boxesโ "+ New box". Pre-create a box-id (e.g.hems-900100); leave serial blank for BYOH. Site name = the customer's email is fine (they'll rename in/accountlater). - Open the box detail page, click "Issue enrollment token", bind it to the customer's email. Copy the token URL.
- Back on
/admin/waitlist, click the customer's email โ opens a mailto: with a prefilled body. Paste the token URL in place ofPASTE_TOKEN_HERE. Send. - Click "Mark invited" to move the row to the dismissed section. Audit row records "waitlist.dismiss reason=invited".
- If dismissing without inviting:
- Click "Dismiss" โ same as above but reason="dismissed".
- No email is sent; the customer just doesn't hear back.
If you want to automate the email send, point your terminal at
/data/waitlist.jsonl, write a small script that calls
/admin/boxes/<id>/issue_enroll_token and sends mail via your
SMTP. Out of scope for v0.1.
Approving a data-deletion request
Triggered when a customer hits Request deletion at
/account/data. You'll see:
- Discord ping (if configured): "๐๏ธ Data-deletion request"
/admin/auditrow:action=user.delete_request,target_type=user, detail includes their box-ids + optional reason
GDPR allows 30 days; aim for 7. Procedure:
# On hems-vps. user_id from the audit detail.
ssh hems-vps
sudo docker exec -it fleet-db psql -U fleet_admin -d fleet
-- Confirm scope: count what'd be deleted.
SELECT box_id, count(*)
FROM metrics
WHERE box_id IN ('hems-900001', 'hems-900002') -- from the audit row
GROUP BY box_id;
-- Then delete in a transaction.
BEGIN;
DELETE FROM metrics WHERE box_id IN (...);
DELETE FROM daily_money WHERE box_id IN (...);
DELETE FROM controller_decisions WHERE box_id IN (...);
-- Spot-check other tables that key on box_id (alerts, command_replies):
DELETE FROM command_replies WHERE box_id IN (...);
DELETE FROM alerts_24h WHERE box_id IN (...);
COMMIT;
Then record the action:
sudo docker exec -it ops-db psql -U ops_admin -d ops -c "
INSERT INTO admin_audit (actor, action, target_type, target_id, detail, result)
VALUES (
'your.email@otonomo.be',
'user.delete_executed',
'user',
'42', -- user_id
'{\"boxes\": [\"hems-900001\"]}'::jsonb,
'ok'
);"
Finally, email the customer "your data has been deleted." Their account + box rows stay in auth-db for 30 days (re-onboarding window) before auto-purge.
Note: schema migrations stay manual per CLAUDE.md guardrails. If a deletion request touches a table we haven't accounted for here, surface it before running anything.
Flipping a customer from observe to active (paid)
When a customer pays โฌ 99 + โฌ3/mo (Stripe / Mollie integration is M2 โ v0.1 is manual):
- Confirm the payment landed (whatever billing tool).
- Open
/admin/users/<id>(or/admin/boxes/<box_id>). - Click "Flip to active" (existing operator action).
- The customer's
/account/controlpage now shows the per-capability toggle UI. They opt in to what they want optimized. - Cloud orchestrators start writing commands within ~60 s.
Audit row: box.mode_flip, before/after in detail.
To flip back to observe (cancellation, dispute, etc.) โ same path, "Flip to observe". Cloud stops writing immediately; box keeps running. Customer's local UI is untouched.
Install support โ common customer tickets
"Installer says cannot reach cloud at https://app.otonomo.be"
Their box can't reach our HTTPS. Ask them to run:
curl -v https://app.otonomo.be/healthz
Possibilities: - Firewall blocking outbound 443 (most common in office networks) - DNS failure (no /etc/resolv.conf or it points at a captive portal) - No internet at all
Also they need port 8883 outbound for MQTT/TLS telemetry. If they can curl healthz but can't enroll, walk them through testing 8883:
openssl s_client -connect app.otonomo.be:8883 -servername app.otonomo.be
"Installer hung at 'building Python venv'"
This is the slow step โ first time, 60-90 s on a Pi Zero 2 W. Tell
them to wait, then check journalctl -u otonomo-publisher after.
"Local UI shows 'waiting for first telemetry' forever"
99% of the time: no drivers configured yet. Walk them to
http://<their-box-ip>:8080/drivers, pick the driver matching their
inverter / boiler / charger, fill in IP + creds.
If they say "but the driver is configured" โ check the publisher:
ssh customer-pi # (their reverse tunnel โ see /admin/boxes/<id>)
sudo systemctl status otonomo-publisher
sudo journalctl -u otonomo-publisher --no-pager -n 60
Look for "DeviceUnreachable" โ usually a wrong IP or firewall on the inverter's LAN side.
"It worked, then after a power cut the device disappeared"
Most likely the router gave the hardware a new DHCP IP.
Fast path:
- Open the customer's
/pro/devicespage. - Expand IP address for the affected device.
- If the box is online, use
/onboarding/configureor the add-device flow's Scan LAN from box to find likely candidates. - If the scan shows the same MAC at a new IP, save the new IP in
/pro/devices. - Add/save the MAC too if it is visible, so future DHCP changes can recover automatically.
Fresh factory installs after commit a9cef5e have SDK-side DHCP recovery:
after three consecutive DeviceUnreachable polls, then at most every five
minutes, the publisher scans for the configured host_mac, rewrites the local
manifest if the MAC moved, syncs the new host to cloud, and restarts itself.
Caveats:
- Existing boxes installed before that payload need a fresh image, re-run installer, or future OTA/SDK update before automatic recovery exists.
- Manifests without
host_maccannot recover automatically. Use LAN scan and save the MAC once. - DHCP reservation on the router is still the cleanest long-term fix when the customer can do it.
"I want HA to see my data"
Direct them to installer/ha_integration/README.md. Three-step
config-flow setup with their box IP + box-id. Works in all three
install modes.
"How do I stop the cloud uplink without uninstalling?"
sudo systemctl stop otonomo-publisher
sudo systemctl disable otonomo-publisher
Their drivers + local UI + HA integration keep working. v0.2 ships a friendlier UI toggle.
Health probes โ what to watch
Every few hours, glance at:
/admin/healthโ fleet-wide rolled-up status/admin/auditโ recent operator + user actionsscripts/validate_deploy.shโ run after any non-trivial change to cloud/app/
Red flags:
- >5 ERROR lines in last 5 min in hems-app logs โ SSH and tail
sudo docker logs hems-app -f
- Controller hasn't ticked in >150 s โ check fleet-db
connectivity, EMQX broker, the controller's own log line
- Telemetry stale across all boxes โ likely EMQX issue;
sudo docker compose -f /opt/hems/broker/docker-compose.yml ps
If something's actually on fire and you need to read the recovery
runbook without AI: docs/runbook_disaster_recovery.md (also lives
in the fireproof safe).
Multi-session protocol reminder
If multiple Claude sessions are working in parallel, charter rule
11 applies: at session start, the user names the other lanes; on
ANY deploy / git / lock anomaly the session must STOP and surface, never silently merge another session's work. The deploy-cloud / -pi scripts have gates that enforce this at the push layer; the human discipline ensures the conflicts get resolved correctly.
If you (the operator) see a deploy fail with a non-FF push, that's
the gate doing its job. Don't bypass with SKIP_LINT=1 unless you
truly know what you're doing โ re-run after a rebase is almost
always the right answer.
Open future-work flags (track here, not in production)
- Schema migration for waitlist โ currently
/data/waitlist.jsonl. When real DB needed: design + migrate, replace_load_waitlistand_append_waitlistinonboarding.py+admin.py. - Email + Stripe integration โ manual flip-to-active will stop scaling around the 30th customer. v0.2 territory.
- v0.2 per-destination uplink toggle โ plan in
docs/otonomo_privacy_toggle_v02_plan.md. - HA integration v0.2 โ per-capability override switches. Needs a stable HTTP cmd contract on the local UI first.
- Driver wheels to PyPI โ blocked on BOIP close (~July 2026).