If you are moving from RabbitMQ or Kafka to NATS JetStream, you usually feel a sense of relief. The configuration is simpler, the binaries are smaller, and the performance is excellent.
But eventually, you hit the “Day 2” reality of event-driven architectures: Message Failures.
When a consumer fails to process a message, you can’t just drop it. You need a retry policy, and eventually, a Dead Letter Queue (DLQ). While NATS provides all the primitives to build this, it does not provide a “system” out of the box. You have to wire it yourself.
This guide walks through how to architect a robust DLQ flow using native JetStream features, and how to manually handle message replays when things go wrong.
nats-server -js)nats CLI installedIn JetStream, a DLQ isn’t a special checkbox; it’s just another Stream. To set this up correctly, you need a specific topology involving MaxDeliver, proper acknowledgment handling, and advisory message capture.
First, your consumer configuration determines how many times NATS attempts to deliver a message before giving up.
When you configure your consumer, pay attention to MaxDeliver and BackOff. A linear retry is rarely what you want. You want a geometric backoff to prevent thundering herd scenarios on your database.
nats consumer add ORDERS processing_worker \
--max-deliver=5 \
--backoff=1s,5s,1m,5m,10m
Important: The BackOff configuration only applies to acknowledgment timeouts, not to explicit NAK responses. When a consumer sends a Nak(), the message is redelivered immediately. To add delay on an explicit NAK, use NakWithDelay() in your client code:
// Immediate redelivery
msg.Nak()
// Delayed redelivery (recommended for transient failures)
msg.NakWithDelay(5 * time.Second)
Once MaxDeliver is reached, the consumer stops attempting to deliver that specific message. The message is not deleted—it remains in the stream but is simply skipped by that consumer. To capture these failures, we use JetStream’s advisory system.
When a message exhausts its delivery attempts, NATS publishes an advisory to:
$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.<STREAM>.<CONSUMER>
Similarly, when your application explicitly terminates a message using Term(), an advisory is published to:
$JS.EVENT.ADVISORY.CONSUMER.MSG_TERMINATED.<STREAM>.<CONSUMER>
Create a stream that captures these advisory messages:
nats stream add DLQ_ORDERS \
--subjects='$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.ORDERS.*,$JS.EVENT.ADVISORY.CONSUMER.MSG_TERMINATED.ORDERS.*' \
--storage=file \
--retention=limits \
--max-age=72h
The advisory payload looks like this:
{
"type": "io.nats.jetstream.advisory.v1.max_deliver",
"id": "EgGS1FReM5Yy1I7L9UXs4W",
"timestamp": "2023-12-24T14:38:13.107632319Z",
"stream": "ORDERS",
"consumer": "processing_worker",
"stream_seq": 6,
"deliveries": 5
}
The stream_seq field is critical—it tells you exactly which message failed, and you can retrieve it directly:
nats stream get ORDERS 6
Instead of waiting for MaxDeliver to be exhausted, your consumer can explicitly terminate a message using Term(). This immediately stops redelivery and publishes an advisory.
Use this when your application can determine a message is permanently unprocessable:
// Message is malformed or missing required data
if !isValidPayload(msg.Data()) {
msg.Term() // Immediately dead-letter, don't retry
return
}
Let’s assume you have a setup where failed messages have triggered advisories in DLQ_ORDERS.
It’s 3:00 AM. An invalid payload caused 500 messages to hit the DLQ. You’ve pushed a hotfix to the consumer code to handle this payload. Now, you need to replay those 500 messages.
NATS does not have a “Replay” button. You have to do this via the CLI or a custom script. Here is the manual workflow for replaying messages without data loss.
First, view the advisory messages to understand what failed:
nats stream view DLQ_ORDERS 10
Each advisory contains the stream_seq of the original message. To inspect the actual failed message:
nats stream get ORDERS <stream_seq>
Look at the headers and payload to understand why processing failed.
You need to read the failed messages and publish them back to the original subject (orders.created), not the DLQ subject.
Note: The nats stream backup and nats stream restore commands are designed for disaster recovery, not for selective message replay. They create binary snapshots and restore to the same stream, which isn’t what we want here.
For a single message, you can replay it directly:
seq=6 # The stream_seq from the advisory
SOURCE_STREAM="ORDERS"
msg=$(nats stream get $SOURCE_STREAM "$seq" --json)
subject=$(echo "$msg" | jq -r '.subject')
echo "$msg" | jq -r '.data' | base64 -d | nats pub "$subject"
For batch replay of all DLQ advisories:
#!/bin/bash
# Replay all failed messages from DLQ advisories
DLQ_STREAM="DLQ_ORDERS"
SOURCE_STREAM="ORDERS"
# Get both first and last sequence numbers
info=$(nats stream info "$DLQ_STREAM" --json)
first_seq=$(echo "$info" | jq -r '.state.first_seq')
last_seq=$(echo "$info" | jq -r '.state.last_seq')
for i in $(seq "$first_seq" "$last_seq"); do
# Get advisory and extract source stream sequence
advisory=$(nats stream get "$DLQ_STREAM" "$i" --json 2>/dev/null) || continue
stream_seq=$(echo "$advisory" | jq -r '.data' | base64 -d | jq -r '.stream_seq')
# Get original message
original=$(nats stream get "$SOURCE_STREAM" "$stream_seq" --json 2>/dev/null) || continue
subject=$(echo "$original" | jq -r '.subject')
data=$(echo "$original" | jq -r '.data' | base64 -d)
# Republish
echo "$data" | nats pub "$subject"
echo "Replayed seq=$stream_seq to $subject"
done
If you are doing this manually during an incident, you face several issues:
Deduplication edge case: If you use explicit Nats-Msg-Id headers for exactly-once publishing AND you’re replaying within the dedup window (default 2 minutes), the message will be rejected. In practice, DLQ messages are usually hours old, so this rarely matters.
Ordering: If strict ordering matters, replaying from a script breaks the global order relative to new incoming messages.
Visibility: While running your script, you have no visual confirmation of the drain rate other than repeatedly running nats stream info.
There’s a subtle but serious issue with the replay workflow above: the original message is still in the source stream.
When you replay a message by publishing it to the original subject, you’re creating a new message. If multiple consumers subscribe to the same stream—common in microservice architectures where different services react to the same events—this causes duplicate processing.
Consider this scenario:
ORDERS stream has two consumers: billing_worker and notification_workerbilling_worker fails to process message seq=42 (database timeout) and it hits MaxDelivernotification_worker successfully processed seq=42 and sent a confirmation emailorders.creatednotification_worker sends a duplicate emailThis is particularly dangerous for side-effecting operations: duplicate charges, duplicate notifications, or duplicate API calls to external systems.
Configure your consumers to subscribe to multiple subjects, including a replay-specific one:
# Original consumer subscribes to both subjects
nats consumer add ORDERS billing_worker \
--filter='orders.created,orders.created.replay.billing'
When replaying, publish to the consumer-specific replay subject:
echo "$data" | nats pub "orders.created.replay.billing"
This ensures only the failed consumer receives the replayed message.
Add metadata to replayed messages so consumers can detect and handle them:
# Add replay headers when republishing
echo "$data" | nats pub "$subject" \
--header="X-Replay: true" \
--header="X-Original-Seq: $stream_seq" \
--header="X-Target-Consumer: billing_worker"
Consumers check the header and skip if they’re not the target:
if msg.Header.Get("X-Replay") == "true" {
target := msg.Header.Get("X-Target-Consumer")
if target != "" && target != consumerName {
msg.Ack() // Not for us, acknowledge and skip
return
}
}
The most robust solution is designing consumers to be idempotent from the start. Store a processed message log keyed by stream_seq or a business-level identifier:
func processOrder(msg jetstream.Msg) error {
orderID := extractOrderID(msg.Data())
// Check if already processed
if wasProcessed(orderID) {
return msg.Ack() // Idempotent: safe to skip
}
// Process and mark as done atomically
if err := processAndMarkDone(orderID, msg.Data()); err != nil {
return msg.NakWithDelay(5 * time.Second)
}
return msg.Ack()
}
This protects against replays, network retries, and JetStream’s at-least-once delivery semantics.
The workflow above—detect, inspect, script, replay—is standard practice. It works. But it is slow, and writing one-off scripts during an outage is stress-inducing.
This is exactly why we built StreamTrace.
We realized that managing NATS JetStream shouldn’t require writing bash scripts at 3 AM just to move a message from one stream to another. StreamTrace is a lightweight UI that sits on top of your NATS clusters.
It turns the “Hard Way” into a UI workflow:
nats stream info.It is a self-hosted Docker container that connects directly to your NATS server. No cloud data leakage, no complex setup.
If you are tired of writing restoration scripts, check out Stream Trace.