Deploying CritterWatch as a cluster
CritterWatch is single-node by default — no extra configuration needed for small deployments. Switching to a horizontally-scaled cluster (2+ BFF nodes behind a load balancer) is opt-in and additive.
Default-on since #237.
AddCritterWatchServices(...)now turns on cluster partitioning by default — supplyconfigureClusterShardedTopologymatching your transport mix. PassenableClusterPartitioning: falseto opt out (mostly relevant for integration tests thatDisableAllExternalWolverineTransports()). Producer side wired symmetrically viaAddCritterWatchMonitoring(..., configureShardedTopology: ...). The legacy singlecritterwatchlistener is still kept by the BFF for backwards compatibility with monitored services that haven't opted in to the producer-side hook — it'sListenOnlyAtLeader()-pinned so a multi-node BFF routes legacy traffic through exactly one node.
What changes when you cluster
| Concern | Single node | Clustered |
|---|---|---|
| Per-service single-writer | local (in-process slots) | GlobalPartitioned distributes by service id; no two nodes process the same service's updates concurrently |
| Periodic alert evaluators / metrics scrapers | run in-process | publish a tick on every node; the matching LocalQueueFor<Tick>().ListenOnlyAtLeader() handler runs on the elected leader only — no duplicate alerts, scrapes, or Slack/email/webhook side effects |
| SignalR fan-out across browsers | in-process hub | Redis backplane fans out every server-pushed message to all connected clients across nodes |
| Marten async daemon | self-distributes via Wolverine-managed subscription distribution (unchanged) | same |
Enabling the Redis SignalR backplane
The backplane is config-driven: set a redis connection string and CritterWatchHostingExtensions.AddCritterWatch automatically chains AddStackExchangeRedis() onto the SignalR builder. Absent the connection string, single-node SignalR (the default) keeps working.
Aspire (dev / test)
The Aspire BffHost declares a redis resource and references it from the BFF — Aspire injects ConnectionStrings__redis automatically, so a clustered dev run is dotnet run from src/BffHost with no extra knobs.
Docker Compose
docker-compose.yml ships a redis:7-alpine service on the default port. Hosts running outside Aspire set the connection string in appsettings.json (or via ConnectionStrings__redis):
{
"ConnectionStrings": {
"redis": "localhost:6379"
}
}Azure SignalR (opt-in, documented-and-supported)
Wolverine.SignalR uses the standard ASP.NET Core Hub / IHubContext, so any scale-out provider that hooks into the SignalR DI builder fans out cross-node with no broadcast-code changes. To use Azure SignalR Service instead of the Redis backplane, do not set the redis connection string and add Azure SignalR alongside AddCritterWatch:
builder.AddCritterWatch(connectionString);
builder.Services.AddSignalR().AddAzureSignalR();Exactly one backplane per deployment. CritterWatch only ships the Redis integration out of the box; Azure SignalR is documented but not bundled or CI-exercised.
Enabling global partitioning (per-service single-writer)
The Redis backplane handles fan-out of outbound SignalR traffic. Global partitioning is the matching story on the inbound side: it guarantees that all updates for a given monitored service land on a single BFF node, cluster-wide, so two BFF nodes never race to project the same ServiceSummary aggregate. It's opt-in and additive — single-node deployments don't need it.
Pick N = your expected BFF node count
The integer N you pass to UseSharded…Queues(...) is the partition count: that many physical sharded queues are declared on the transport, and Wolverine hashes each message's group id (the monitored service's ServiceName / Id) mod N to decide which slot it lands on. Each slot is owned by exactly one BFF node.
Set N to the number of BFF nodes you expect to run. With N = 5 and 3 nodes, two nodes carry two slots each and one carries one — load skews slightly but every node has work. With N smaller than your node count, some BFF nodes sit idle (Wolverine assigns each slot to a single node). With N much larger than your node count, you pay overhead for queues you don't need.
N must agree exactly between the producer side and the consumer side — mismatched hashes route to slots the consumer isn't listening on. Pick one value and centralise it (a shared constant, environment variable, or the service-handshake mechanism the BFF already uses for capability negotiation).
Wire the consumer side (BFF)
opts.AddCritterWatchServices(
NpgsqlDataSource.Create(connectionString),
// enableClusterPartitioning defaults to true (#237) — listed here just
// for completeness. Pass false to opt out.
configureClusterShardedTopology: topology =>
{
// Mix and match per the transports this BFF actually uses.
topology.UseShardedRabbitQueues("critterwatch", 5);
// topology.UseShardedAmazonSqsQueues("critterwatch", 5);
// topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
});Without the callback, AddCritterWatchServices throws ArgumentNullException("configureClusterShardedTopology", "… UseShardedRabbitQueues …") — Wolverine's GlobalPartitionedMessageTopology.AssertValidity() requires a sharded external topology be registered at the same time as the message subscription, and a parameter-anchored error makes the missing argument obvious instead of bubbling Wolverine's deeper "external transport topology must be configured" message.
Wire the producer side (every monitored service)
opts.AddCritterWatchMonitoring(
critterWatchUri: new Uri("rabbitmq://queue/critterwatch"),
systemControlUri: new Uri("rabbitmq://queue/my_service_control"),
configureShardedTopology: topology =>
{
// Same value of N. Same transport-specific call.
topology.UseShardedRabbitQueues("critterwatch", 5);
});Once both sides ship, ICritterWatchMessage traffic (ServiceUpdates, AgentHealthReport, ShardStatesChanged, …) flows over the sharded slots. Heartbeats (WolverineHeartbeat) and MessageHandlingMetrics keep flowing over the unsharded critterwatch URI you've always passed — the BFF deliberately doesn't shard those, and a sharded slot with no listener would dead-letter them.
What if I only ship one side?
The producer and consumer hooks are independent rollouts. Both have a default-off path so half-finished migrations are graceful:
| Producer | Consumer | What happens |
|---|---|---|
| sharded | sharded | Full per-service single-writer. Recommended for multi-BFF deployments. |
| sharded | single (default) | Producer's ICritterWatchMessage lands on the sharded slots but no BFF is listening on them — messages stall on the broker. Don't roll out the producer side until the BFF is on the matching N. |
| single (default) | sharded | BFF still listens on the legacy single critterwatch queue alongside the sharded slots. Older monitored services keep working untouched. Roll out the consumer side first. |
| single (default) | single (default) | Single-queue legacy path. The BFF's listener is ListenOnlyAtLeader()-pinned (see below) so multi-node BFFs don't split-brain on it. |
Legacy single-queue listener is leader-pinned
Even without partitioning, the BFF's ListenToRabbitQueue("critterwatch") and ListenToSqsQueue("critterwatch") call .ListenOnlyAtLeader(). In a single-node deployment that's identical to the pre-leader-aware default (the sole node is the leader). In a multi-node deployment, only one node consumes the legacy queue at a time — preventing the optimistic-concurrency retry storms and split-brain ServiceSummary processing that competing consumers on a single queue would otherwise cause. The sharded slots stay leader-agnostic; only this back-compat queue is leader-pinned.
Load balancer requirements
Health endpoints are LB-appropriate (each node serves /health); the boot-smoke CI gate (#216) asserts the same endpoint reports Healthy.
No sticky sessions required. The Redis backplane fans every SignalR send to every node, so a client that connects to node B receives updates produced on node A. The same property holds for Azure SignalR. Configure the LB for plain round-robin (or least-connections) over WebSocket — sticky sessions add no value and can mask backplane misconfiguration.
Cluster correctness audit (#217)
The 7 BackgroundServices in CritterWatch.Services are classified as follows:
| Service | Classification | Why |
|---|---|---|
MetricsAlertEvaluator | cluster-singleton | persists alert records + publishes lifecycle messages |
ProjectionAlertEvaluator | cluster-singleton | same shape |
PrometheusScrapingService | cluster-singleton | external HTTP fetch + persistence |
MetricsIdleReEvaluator | cluster-singleton | re-publishes rollups |
StateRefreshService | per-node OK | refreshes its own connected clients; backplane fans out |
AlertBatchAccumulator | per-node OK | batches what this node received; backplane fans out |
SignalRBatchAccumulator | per-node OK | same shape |
