Real-Time IoT at Scale: WebSockets, MQTT, and Lessons Learned

The brief: a real-time monitoring dashboard for a national law enforcement agency. Hundreds of field devices, including vehicles, sensors, and body cameras, stream location, status, and event data. Sub-second latency requirement. 24/7 uptime. No room for failure.

Here's the architecture we built, what broke, and what held.

The Problem With Polling

The naive approach has the client poll the server every second, and it dies at scale. 500 devices x 1 request/second = 500 req/s just for status updates, before any user traffic. Polling also gives eventual consistency, not real-time. A device state change shows up 0-1000ms late depending on poll timing.

Real-time IoT needs a push architecture. The device pushes data when something changes; the server pushes to clients immediately. No polling, no lag.

The Stack: MQTT + WebSockets

We split the problem in two:

Device to Server: MQTT. Lightweight, designed for IoT, handles unreliable connections gracefully, runs on constrained hardware.
Server to Browser: WebSockets. Full-duplex, works everywhere, no polling overhead.

Field devices
  -> MQTT (port 8883, TLS)
  -> MQTT Broker (Mosquitto)
  -> Node.js bridge service
  -> Redis Pub/Sub
  -> WebSocket servers (multiple instances)
  -> Browser clients

Redis Pub/Sub is the critical middle layer. It decouples the MQTT bridge from the WebSocket servers, letting you scale each independently. A message published to Redis reaches every WebSocket server instance and every connected browser client in under 5ms.

MQTT: The Device Layer

MQTT runs on a publish/subscribe model. Devices publish to topics; subscribers receive messages. Our topic structure:

devices/{device_id}/location      # GPS coordinates, heading, speed
devices/{device_id}/status        # online/offline, battery, signal
devices/{device_id}/events        # alerts, triggers, incidents
fleet/+/location                  # wildcard: all device locations

QoS level matters. We use QoS 1 (at-least-once delivery) for events and QoS 0 (fire-and-forget) for location updates. Location data is high-frequency and stale the moment it arrives, so a dropped packet doesn't matter. An incident event must be delivered.

Handling Disconnections

Field devices go offline constantly in tunnels, dead zones, and during reboots. MQTT's Last Will and Testament (LWT) handles this gracefully: the broker publishes a "device offline" message automatically when a connection drops unexpectedly. No application-level heartbeat logic needed.

// Device connects with LWT configured
client.connect({
  will: {
    topic: `devices/${deviceId}/status`,
    payload: JSON.stringify({ online: false, timestamp: Date.now() }),
    qos: 1,
    retain: true   // new subscribers see last known state immediately
  }
})

Retained messages are equally important. A new browser client connecting to the dashboard sees the current state of all devices instantly, without waiting for the next update from each device.

WebSockets: The Browser Layer

We run multiple Node.js WebSocket server instances behind a load balancer. The problem: WebSocket connections are stateful. A browser connected to Instance A can't receive messages published by Instance B unless they share state.

Redis Pub/Sub solves this. Every WebSocket instance subscribes to the same Redis channels. Every message from a device reaches every instance, which forwards it to connected browser clients.

// WebSocket server (simplified)
const redisSubscriber = createClient()
await redisSubscriber.subscribe('device-updates', (message) => {
  const update = JSON.parse(message)
  broadcastToRoom(update.deviceId, update)
})

wss.on('connection', (ws, req) => {
  const { deviceIds } = parseSubscription(req)
  deviceIds.forEach(id => addClientToRoom(id, ws))
  ws.on('close', () => {
    deviceIds.forEach(id => removeClientFromRoom(id, ws))
  })
})

What Broke at Scale

The Memory Leak

At around 800 concurrent WebSocket connections, memory climbed and never came back down. Root cause: event listeners on the ws object weren't being cleaned up on disconnect. Every closed connection left a dangling listener. Fixed with explicit cleanup in the close handler and a WeakMap for client tracking.

Message Storm on Reconnect

When the broker restarted, all 400+ devices reconnected simultaneously and published their retained state. The bridge service received 400 messages in ~200ms, overwhelmed the Redis pipeline, and backed up. Fixed with connection jitter (random 0-5s reconnect delay on device firmware) and a message queue with backpressure on the bridge.

The Database Write Problem

We were writing every location update to PostgreSQL in real-time. At 2 updates/second per device x 400 devices = 800 writes/second. Postgres handled it, but barely, and query latency spiked. Solution: write location to Redis (fast, ephemeral) for real-time display; batch-write to Postgres every 30 seconds for historical queries. Different data, different storage, different access patterns.

Real-time and persistent are different requirements. Don't force the same storage layer to serve both.

Monitoring

A real-time system that breaks silently is the worst outcome. We instrument:

MQTT broker: connected clients, message rate, dropped connections
Bridge service: queue depth, processing latency, Redis publish errors
WebSocket servers: connected clients per instance, message broadcast latency
End-to-end: synthetic device-to-browser latency measured every 30 seconds

The end-to-end synthetic test is the most valuable. It's the only metric that catches cascading failures across multiple layers simultaneously.

Numbers

Peak concurrent devices: ~600
Peak concurrent browser clients: ~120
Average device to browser latency: 180ms
p99 device to browser latency: 420ms
Uptime over 12 months: 99.94%

The full stack: Mosquitto as MQTT broker, Node.js for the bridge and WebSocket servers, Redis 7 for Pub/Sub and hot data, PostgreSQL for historical data, React on the frontend with a custom WebSocket hook, deployed on bare-metal VMs behind Nginx. No managed services. The client's security requirements mandated on-premise.