I built a shell for my ESP32 over MQTT

(so I could stop driving out to fix things)

The wood boiler firmware runs at my brother Theo’s place in rural Alberta. I am twenty minutes away. Not far. But “twenty minutes” stops being the unit when I want to nudge an alert threshold and check whether it actually fires, then nudge it again.

It also stops being the unit when I am not home. A few weekends ago I was in southern Alberta visiting friends, three hours away, and one of the boiler alerts failed to fire when it should have. I pulled the iPad out of the car, opened VS Code Web, tunneled into my code server at home, and walked through the device’s logs and config from a kitchen table. The fix took about ten minutes. The drive to “fix it properly” would have been six hours round trip.

The honest options before any of this existed were:

  1. SSH. No. It is a microcontroller. There is no SSH.
  2. Push a firmware update with the new value baked in. Wait for the device to fetch it. Hope it boots.
  3. Drive out and edit the config over USB.

None of those felt right for “change one number.” Option 2 is heavy. Option 3 burns at least an hour for the round trip plus the fix, more when I am not in town. So I did what felt obvious in hindsight: I gave the firmware a shell, and I plumbed it through the MQTT broker the device was already talking to.

That decision is the whole post. Once your device has a broker, your broker is already a bidirectional channel. If you only ever publish from the device, you are using half of it.

The pivot

Every ESP32 in the field is already opening a TLS connection to the broker. It publishes telemetry, listens for retained config blobs, shows up on the Home Assistant dashboard. Adding a CLI to that picture is not a new connection, not a new port, not a new firewall rule on Theo’s router. It is just two more topics on the broker the device is already attached to.

The shape ended up being:

<prefix>/cli/<command>   <- I publish here, device executes
<prefix>/cli/response    <- device publishes here, I read

<prefix> is the device’s MQTT root, like thesada/owb. The device subscribes once at boot to <prefix>/cli/+. The plus is the single-level wildcard: cli/fs.cat, cli/config.dump, cli/restart, cli/ota.check, all hit the same handler, and the handler just looks at the last topic segment to figure out which command was invoked. The payload is the argument, plain bytes, whatever the command needs.

The response goes back on a single fixed topic, cli/response, as JSON. One line, easy to parse, easy to print. The topic shape gives the device-level isolation for free: another device’s broker credentials cannot publish into my device’s cli/, because the broker ACL only lets each device read and write under its own prefix.

QoS 0, retain false. Commands are not replayed on reconnect. If the device was offline when I sent it, that command is gone, and that is the behavior I want. The last thing I need is a device that comes back from a power cycle and reboots itself again because there was a stale cli/restart sitting in retained storage.

What it looks like

Here is a normal poke at the device from my laptop. mosquitto_pub and mosquitto_sub are the only tools, plus a second terminal pane to watch responses come back.

# Terminal 1: subscribe to the response topic
$ mosquitto_sub -h mqtt.example.lan -t 'thesada/owb/cli/response' -v

# Terminal 2: ask the device what config it is running
$ mosquitto_pub -h mqtt.example.lan -t 'thesada/owb/cli/config.dump' -m ''

# Terminal 1 prints (formatted for readability):
thesada/owb/cli/response  {
  "ok": true,
  "output": [
    "{",
    "  \"wifi\": { \"ssid\": \"...\" },",
    "  \"mqtt\": { \"broker\": \"...\" },",
    "  \"alerts\": { \"temp_high\": 92, \"temp_crit\": 98 }",
    "}"
  ]
}

# Read a Lua script off the device's flash
$ mosquitto_pub -h mqtt.example.lan -t 'thesada/owb/cli/fs.cat' \
    -m '/scripts/rules.lua'

# Push an updated script back
$ mosquitto_pub -h mqtt.example.lan -t 'thesada/owb/cli/fs.write' \
    -f /scripts/rules.lua

# Trigger an OTA check now (instead of waiting for the next interval)
$ mosquitto_pub -h mqtt.example.lan -t 'thesada/owb/cli/ota.check' -m '--force'

# Reboot
$ mosquitto_pub -h mqtt.example.lan -t 'thesada/owb/cli/restart' -m ''

Now, the honest part. None of those commands are MQTT-specific. They are the same shell commands I already had wired up over serial, for the days I am sitting next to a board with a USB cable. fs.cat, fs.write, config.dump, ota.check, restart, plus a couple of dozen others I have not bored you with yet. Once I started thinking about a remote shell, the “build a new API for it” instinct was the wrong one. Why reinvent the wheel? The serial shell already had a parser, an output format, error handling. I just gave it a second input and output channel.

So there is one set of commands. Sitting at the bench with a USB cable, I type into the serial console. Sitting on a kitchen table in southern Alberta with an iPad, I publish to MQTT. Same words, same arguments, same response format. That is what I mean by “unified” - not a clever framework, just a deliberate choice not to fork the API.

The Lua piece deserves its own post. Alert logic lives in /scripts/rules.lua on the device’s filesystem, not in the C++ firmware. So “tweak an alert threshold” is fs.cat, edit, fs.write, done. No recompile, no OTA, no reboot.

Why this matters

The pitch sounds clean. The lived experience taught me what the shell is actually for. Three quick stories.

The alert that did not fire. This is the southern Alberta one I mentioned at the top. The boiler hit a temperature that should have triggered a Telegram alert. It did not. From the iPad I pulled the current rules.lua, scanned the comparison logic, found a unit mismatch I had introduced in the previous session (Celsius vs the raw sensor scale, my fault), wrote a corrected script back, and asked the device to restart so the Lua engine reloaded clean. Ten minutes from “that is wrong” to “that is fixed and live.”

The Telegram heap problem. Telegram is delightful as an alert channel and slightly less delightful when its TLS handshake decides to allocate during a heap-tight moment. I had a window where the device would publish telemetry fine, then drop MQTT during a Telegram send, because the Telegram client had eaten the headroom MQTT needed to keep its own TLS session alive. I caught the pattern by hand from the shell, dumping heap stats while triggering test alerts, until the shape of the failure was obvious. The fix was a heap floor in the Telegram module: if free heap is below the floor when an alert wants to fire, MQTT keeps priority and the alert waits a beat. That fix exists because of an evening of remote poking, not because I predicted it.

The almost-brick. This is the Murphy story. I pushed a config change that quietly broke WiFi reconnect after a sleep cycle. Device rebooted, came up, never made it back to the broker, and I could not talk to it anymore. From the shell, obviously, because the shell goes through the broker. Theo cycled the power on my instructions, the device booted from the previous good config, I fixed my mistake, and moved on. A trip avoided, but a trip narrowly avoided.

That last one is where the watchdog story starts. Every time I have nearly bricked a device remotely, the next firmware build has gained a defensive feature. There is a preventive reboot if free heap stays below a threshold for sixty seconds, on the theory that whatever has the heap that low is probably not going to give it back, and a clean boot from flash is recoverable while a wedge over a shrinking heap is not. There is the Telegram heap floor I mentioned. And once it became clear the bigger boards had PSRAM sitting unused while the main heap fought TLS for room, I started moving the chunky-but-cold buffers (the CA certificate, OTA download buffers) into PSRAM so the working heap had its breathing room back. The shell makes me bolder. The watchdogs and the PSRAM moves catch me when bolder turns into sloppy.

What this unlocks

The remote shell changed how I work on the firmware in a way I did not anticipate when I built it. The first version of any feature is a draft. The shell lets the second, third, fifth version happen at the speed of editing a Lua script and pressing publish, instead of the speed of a build, an OTA, a reboot, and a wait-and-see. That tightens the loop enough that the device feels almost as iterable as the app running on my laptop, even though it is twenty minutes away in a shed.

The other thing I did not anticipate is what happens when there are several of them. I am planning to build something on top of this MQTT API for fleet management - imagine the same cli/* API, but instead of me typing mosquitto_pub against a single device, a web dashboard shows me every device, lets me push a config change to all of them at once, watches drift between what is running and what should be running. The shell is the primitive. The dashboard is the next layer. More on that when it is ready.

Takeaway

If your microcontroller talks to an MQTT broker, your broker is already a bidirectional channel. You can keep using half of it - device pubs, server reads, end of story. Or you can ask: what would my future self thank me for being able to do remotely, six months from now, when the device is in a shed I do not feel like driving to?

For me the answer was a shell. For your project it might be something else. But the shape - “use the existing connection, expose your existing tools over it, do not invent a new API” - probably travels.