Cluster Peer Authentication¶
When you run more than one Hafiz node, the peers talk to each other over HTTP on /cluster/join, /cluster/heartbeat, and /cluster/message. Those endpoints bypass SigV4 on purpose — SigV4 is a client-side protocol and it makes no sense for internal gossip. That leaves an authentication gap that Hafiz closes with a dedicated HMAC layer.
What's protected¶
Every peer request is wrapped in a SignedEnvelope:
{
"v": 1,
"ts": 1713000000,
"nonce": "<16-byte base64 random>",
"payload": "<serialized ClusterMessage>",
"sig": "<HMAC-SHA256 over v1\\n{ts}\\n{nonce}\\n{payload}, base64>"
}
Receivers verify:
- Signature. Re-computed with the shared secret; compared in constant time.
- Freshness. ±300 seconds against the receiver's clock. Replays past that window are rejected.
- Version. Only
v=1is accepted today — bumping the version lets us change the MAC input format without silently re-accepting old senders.
Wrong-secret, tampered payloads, modified timestamps, and stale envelopes all produce HTTP 401.
Rollout¶
Set HAFIZ_CLUSTER_SHARED_SECRET to the same value on every node. Generate one with:
Using docker-compose.cluster.yml the variable is required — compose refuses to start without it, so you can't accidentally deploy an unauthenticated cluster.
echo "HAFIZ_CLUSTER_SHARED_SECRET=$(openssl rand -hex 32)" >> .env
docker compose -f docker-compose.cluster.yml up -d
Behavior when unset (legacy mode)¶
If you leave HAFIZ_CLUSTER_SHARED_SECRET empty, Hafiz still boots, but the startup logs print a loud warning:
WARN cluster peer auth DISABLED — HAFIZ_CLUSTER_SHARED_SECRET is unset.
Any network peer knowing the cluster name can inject messages.
That mode is kept purely for staged migrations — set the secret on every node, restart in a rolling order, done. Don't ship production without it.
Observability¶
On boot with the secret set:
When a stale-version node or an attacker probes the cluster endpoints:
WARN cluster: unsigned payload rejected (secret configured) — body[0..120]=…
WARN cluster: envelope rejected: cluster peer auth: signature mismatch
WARN cluster: envelope rejected: cluster peer auth: envelope timestamp skew too large (600s)
The first 120 chars of the rejected body are logged so an operator can tell whether the source is a legit misconfigured peer or an attacker probe.
Rotating the secret¶
- Update
HAFIZ_CLUSTER_SHARED_SECRETon every node. - Rolling-restart the nodes in the same order they were first started (seed last).
- During the rolling restart, half the cluster temporarily speaks the new secret and half the old. Heartbeats fail on the mismatched pair until the last node restarts — this is expected and converges automatically.
Threat model¶
What this protects against:
- Rogue node on the same L2/L3 network injecting forged heartbeats or replication events.
- Someone who scraped
HAFIZ_CLUSTER_NAMEfrom logs or config files. - Replay of a captured heartbeat past 5 minutes.
What this does NOT replace:
- Transport encryption — pair with
cluster_tls_enabled = truefor wire confidentiality. - Admin-plane auth — the
/api/v1/cluster/*admin endpoints still require SigV4. - Node-identity auth — two nodes that both hold the secret are both trusted; use mTLS if you need per-node identity.
Related¶
- Security Architecture — where peer auth fits in the threat model.
- TLS Configuration — transport-level encryption for the peer channel.