Skip to main content
notes
July 20, 20257 min read

The Nginx Buffering Bug That Broke SSE for Two Days

Server-Sent Events in production: how a single missing response header silenced every progress stream, and what the connection registry, heartbeat, and cancellation architecture looks like once you've fixed the obvious thing.

SSENode.jsReal-timeTypeScriptBackend

The streaming worked perfectly in development. In production, behind Nginx, every SSE connection opened successfully, stayed open, and delivered exactly zero events until the job completed — at which point all the progress messages arrived simultaneously.

Day one, I confirmed the EventSource connection was opening — visible in the network tab, status 200, response streaming. No events arriving in real time but the connection was alive. I added server-side logging to verify events were actually being sent. They were. The server was writing to the stream; nothing was reaching the client. I suspected CORS, added headers, no change. Suspected a firewall rule on the VPS blocking streaming responses. Spent time there. Nothing.

Late on day one, I tried bypassing Nginx entirely — hitting the Node.js process directly on its port. The events arrived in real time immediately.

So it was Nginx. I read the Nginx docs more carefully than I had before. Nginx buffers proxy responses by default. For a JSON endpoint, this is desirable. For SSE, it means the buffer fills silently and Nginx holds everything until the connection closes. The events are not lost — they arrive in one batch at the end. Which is the exact opposite of the point.

X-Accel-Buffering: no

That header tells Nginx to pass the response through without buffering. Two days of investigation, one line of fix.

This is what the rest of the SSE infrastructure looks like — the parts that become the actual work once the proxy is out of the way.

The Connection Registry

The first structural problem: the SSE response lives in one request handler, but job progress comes from a worker that has no reference to that response.

The answer is a global registry — a singleton Map that stores active connections keyed by channel ID. The channel ID is generated when the job starts, returned to the client, and used by the client to open the SSE connection. The job worker looks up the connection and writes to it directly:

class ConnectionRegistry {
  private static instance: ConnectionRegistry;
  private connections: Map<string, SSEConnection>;

  register(channelId: string, connection: SSEConnection): void {
    this.connections.set(channelId, connection);
  }

  get(channelId: string): SSEConnection | undefined {
    return this.connections.get(channelId);
  }

  remove(channelId: string): void {
    this.connections.delete(channelId);
  }
}

// Job start endpoint
const channelId = crypto.randomUUID();
startJob(jobParams, channelId); // async, does not await
return Response.json({ channelId });

// Inside the job worker
const connection = registry.get(channelId);
connection?.send({ type: "progress", percent: 42 });

This only works when the client always hits the same server instance. Horizontal scaling means moving the registry to Redis pub/sub — the SSE endpoint subscribes to a channel, the job worker publishes to it. The interface stays the same.

The SSEConnection Class

The connection wrapper handles writing, backpressure, and cleanup:

class SSEConnection {
  private writer: WritableStreamDefaultWriter;
  private encoder = new TextEncoder();
  private closed = false;
  private paused = false;
  private queue: string[] = [];

  send(data: Record<string, unknown>, eventType?: string): void {
    if (this.closed) return;

    const payload = [
      eventType ? `event: ${eventType}` : "",
      `data: ${JSON.stringify(data)}`,
      "",
      "",
    ]
      .filter(Boolean)
      .join("\n");

    if (this.paused) {
      this.queue.push(payload);
    } else {
      this.write(payload);
    }
  }

  pause(): void { this.paused = true; }

  resume(): void {
    this.paused = false;
    this.queue.forEach((msg) => this.write(msg));
    this.queue = [];
  }

  close(): void {
    if (this.closed) return;
    this.closed = true;
    this.writer.close();
  }

  private write(payload: string): void {
    this.writer.write(this.encoder.encode(payload)).catch(() => this.close());
  }
}

The queue handles backpressure — if the client signals pause via a control endpoint, messages accumulate rather than being dropped. For most jobs this code path never runs. It exists for the case where the client UI falls behind and needs to catch up.

Heartbeat and Cancellation

SSE connections are HTTP responses held open. Idle connections get terminated by load balancers and mobile network managers after 30–60 seconds of silence. A heartbeat prevents this:

this.heartbeatTimer = setInterval(() => {
  this.write(": heartbeat\n\n");
}, 20_000);

SSE comments — lines starting with : — are ignored by EventSource. They exist purely to keep the TCP connection alive. Twenty seconds is conservative; the goal is to stay well under the shortest timeout policy in the infrastructure chain.

Cancellation is more involved than pause. It needs to stop the underlying job, not just gate the stream. The job worker checks a cancellation flag at each iteration — for I/O-bound work like crawling, the await points are natural checkpoints with acceptable cancel latency. CPU-bound work needs more aggressive checking:

for (const url of urls) {
  if (registry.isCancelled(channelId)) {
    connection?.send({ type: "cancelled" });
    connection?.close();
    registry.remove(channelId);
    return;
  }

  const result = await crawl(url);
  connection?.send({ type: "progress", url, result });
}

Cleanup on Disconnect

This was the last thing I wired up. It should have been the first.

If the client closes the tab, the TCP connection closes. Without cleanup, the registry leaks connections, the job worker keeps writing to a dead stream, and memory grows until the process restarts. With enough concurrent jobs, this happens faster than you'd expect:

request.signal.addEventListener("abort", () => {
  registry.cancel(channelId);
  registry.remove(channelId);
});

request.signal is the AbortSignal attached to the incoming request. It fires on client disconnect. Two lines that determine whether the system degrades gracefully or accumulates garbage. I added them last because disconnects were not something I was simulating in development. That was a mistake.

The Infrastructure Assumption

SSE does not fit the serverless model. Functions have execution time limits; long-running job streaming needs a persistent process. The pattern that works: a serverless function accepts the job request, generates a channel ID, hands the job off to a persistent worker, and returns the channel ID. The client opens the SSE connection directly to the persistent process.

The Nginx buffering issue is invisible in development and nearly invisible in debugging — the connection looks healthy, the server is writing events, everything appears correct until you check whether the events are actually arriving when they're supposed to. The fix is one header. Knowing to look for it takes the two days.