The brief: a real-time monitoring dashboard for a national law enforcement agency. Hundreds of field devices — vehicles, sensors, body cameras — streaming location, status, and event data. Sub-second latency requirement. 24/7 uptime. No room for failure.
Here's the architecture we built, what broke, and what held.
The Problem With Polling
The naive approach — client polls the server every second — dies at scale. 500 devices x 1 request/second = 500 req/s just for status updates, before any user traffic. Polling also gives eventual consistency, not real-time. A device state change shows up 0-1000ms late depending on poll timing.
Real-time IoT needs a push architecture. The device pushes data when something changes; the server pushes to clients immediately. No polling, no lag.
The Stack: MQTT + WebSockets
We split the problem in two:
- Device to Server: MQTT. Lightweight, designed for IoT, handles unreliable connections gracefully, runs on constrained hardware.
- Server to Browser: WebSockets. Full-duplex, works everywhere, no polling overhead.
Field devices
-> MQTT (port 8883, TLS)
-> MQTT Broker (Mosquitto)
-> Node.js bridge service
-> Redis Pub/Sub
-> WebSocket servers (multiple instances)
-> Browser clients Redis Pub/Sub is the critical middle layer. It decouples the MQTT bridge from the WebSocket servers, letting you scale each independently. A message published to Redis reaches every WebSocket server instance — every connected browser client — in under 5ms.
MQTT: The Device Layer
MQTT runs on a publish/subscribe model. Devices publish to topics; subscribers receive messages. Our topic structure:
devices/{device_id}/location # GPS coordinates, heading, speed
devices/{device_id}/status # online/offline, battery, signal
devices/{device_id}/events # alerts, triggers, incidents
fleet/+/location # wildcard: all device locations QoS level matters. We use QoS 1 (at-least-once delivery) for events and QoS 0 (fire-and-forget) for location updates. Location data is high-frequency and stale the moment it arrives — a dropped packet doesn't matter. An incident event must be delivered.
Handling Disconnections
Field devices go offline constantly — tunnels, dead zones, reboots. MQTT's Last Will and Testament (LWT) handles this gracefully: the broker publishes a "device offline" message automatically when a connection drops unexpectedly. No application-level heartbeat logic needed.
// Device connects with LWT configured
client.connect({
will: {
topic: `devices/${deviceId}/status`,
payload: JSON.stringify({ online: false, timestamp: Date.now() }),
qos: 1,
retain: true // new subscribers see last known state immediately
}
}) Retained messages are equally important — a new browser client connecting to the dashboard sees the current state of all devices instantly, without waiting for the next update from each device.
WebSockets: The Browser Layer
We run multiple Node.js WebSocket server instances behind a load balancer. The problem: WebSocket connections are stateful. A browser connected to Instance A can't receive messages published by Instance B — unless they share state.
Redis Pub/Sub solves this. Every WebSocket instance subscribes to the same Redis channels. Every message from a device reaches every instance, which forwards it to connected browser clients.
// WebSocket server (simplified)
const redisSubscriber = createClient()
await redisSubscriber.subscribe('device-updates', (message) => {
const update = JSON.parse(message)
broadcastToRoom(update.deviceId, update)
})
wss.on('connection', (ws, req) => {
const { deviceIds } = parseSubscription(req)
deviceIds.forEach(id => addClientToRoom(id, ws))
ws.on('close', () => {
deviceIds.forEach(id => removeClientFromRoom(id, ws))
})
}) What Broke at Scale
The Memory Leak
At around 800 concurrent WebSocket connections, memory climbed and never came back down. Root cause: event listeners on the ws object weren't being cleaned up on disconnect. Every closed connection left a dangling listener. Fixed with explicit cleanup in the close handler and a WeakMap for client tracking.
Message Storm on Reconnect
When the broker restarted, all 400+ devices reconnected simultaneously and published their retained state. The bridge service received 400 messages in ~200ms, overwhelmed the Redis pipeline, and backed up. Fixed with connection jitter (random 0-5s reconnect delay on device firmware) and a message queue with backpressure on the bridge.
The Database Write Problem
We were writing every location update to PostgreSQL in real-time. At 2 updates/second per device x 400 devices = 800 writes/second. Postgres handled it, but barely, and query latency spiked. Solution: write location to Redis (fast, ephemeral) for real-time display; batch-write to Postgres every 30 seconds for historical queries. Different data, different storage, different access patterns.
Real-time and persistent are different requirements. Don't force the same storage layer to serve both.
Monitoring
A real-time system that breaks silently is the worst outcome. We instrument:
- MQTT broker: connected clients, message rate, dropped connections
- Bridge service: queue depth, processing latency, Redis publish errors
- WebSocket servers: connected clients per instance, message broadcast latency
- End-to-end: synthetic device-to-browser latency measured every 30 seconds
The end-to-end synthetic test is the most valuable. It's the only metric that catches cascading failures across multiple layers simultaneously.
Numbers
- Peak concurrent devices: ~600
- Peak concurrent browser clients: ~120
- Average device to browser latency: 180ms
- p99 device to browser latency: 420ms
- Uptime over 12 months: 99.94%
The full stack: Mosquitto as MQTT broker, Node.js for the bridge and WebSocket servers, Redis 7 for Pub/Sub and hot data, PostgreSQL for historical data, React on the frontend with a custom WebSocket hook, deployed on bare-metal VMs behind Nginx. No managed services — the client's security requirements mandated on-premise.