Skip to content

Deploying CritterWatch as a cluster

CritterWatch is single-node by default — no extra configuration needed for small deployments. Switching to a horizontally-scaled cluster (2+ BFF nodes behind a load balancer) is opt-in and additive.

Default-on since #237. AddCritterWatchServices(...) now turns on cluster partitioning by default — supply configureClusterShardedTopology matching your transport mix. Pass enableClusterPartitioning: false to opt out (mostly relevant for integration tests that DisableAllExternalWolverineTransports()). Producer side wired symmetrically via AddCritterWatchMonitoring(..., configureShardedTopology: ...). The legacy single critterwatch listener is still kept by the BFF for backwards compatibility with monitored services that haven't opted in to the producer-side hook — it's ListenOnlyAtLeader()-pinned so a multi-node BFF routes legacy traffic through exactly one node.

What changes when you cluster

ConcernSingle nodeClustered
Per-service single-writerlocal (in-process slots)GlobalPartitioned distributes by service id; no two nodes process the same service's updates concurrently
Periodic alert evaluators / metrics scrapersrun in-processpublish a tick on every node; the matching LocalQueueFor<Tick>().ListenOnlyAtLeader() handler runs on the elected leader only — no duplicate alerts, scrapes, or Slack/email/webhook side effects
SignalR fan-out across browsersin-process hubRedis backplane fans out every server-pushed message to all connected clients across nodes
Marten async daemonself-distributes via Wolverine-managed subscription distribution (unchanged)same

Enabling the Redis SignalR backplane

The backplane is config-driven: set a redis connection string and CritterWatchHostingExtensions.AddCritterWatch automatically chains AddStackExchangeRedis() onto the SignalR builder. Absent the connection string, single-node SignalR (the default) keeps working.

Aspire (dev / test)

The Aspire BffHost declares a redis resource and references it from the BFF — Aspire injects ConnectionStrings__redis automatically, so a clustered dev run is dotnet run from src/BffHost with no extra knobs.

Docker Compose

docker-compose.yml ships a redis:7-alpine service on the default port. Hosts running outside Aspire set the connection string in appsettings.json (or via ConnectionStrings__redis):

jsonc
{
  "ConnectionStrings": {
    "redis": "localhost:6379"
  }
}

Azure SignalR (opt-in, documented-and-supported)

Wolverine.SignalR uses the standard ASP.NET Core Hub / IHubContext, so any scale-out provider that hooks into the SignalR DI builder fans out cross-node with no broadcast-code changes. To use Azure SignalR Service instead of the Redis backplane, do not set the redis connection string and add Azure SignalR alongside AddCritterWatch:

csharp
builder.AddCritterWatch(connectionString);
builder.Services.AddSignalR().AddAzureSignalR();

Exactly one backplane per deployment. CritterWatch only ships the Redis integration out of the box; Azure SignalR is documented but not bundled or CI-exercised.

Enabling global partitioning (per-service single-writer)

The Redis backplane handles fan-out of outbound SignalR traffic. Global partitioning is the matching story on the inbound side: it guarantees that all updates for a given monitored service land on a single BFF node, cluster-wide, so two BFF nodes never race to project the same ServiceSummary aggregate. It's opt-in and additive — single-node deployments don't need it.

Pick N = your expected BFF node count

The integer N you pass to UseSharded…Queues(...) is the partition count: that many physical sharded queues are declared on the transport, and Wolverine hashes each message's group id (the monitored service's ServiceName / Id) mod N to decide which slot it lands on. Each slot is owned by exactly one BFF node.

Set N to the number of BFF nodes you expect to run. With N = 5 and 3 nodes, two nodes carry two slots each and one carries one — load skews slightly but every node has work. With N smaller than your node count, some BFF nodes sit idle (Wolverine assigns each slot to a single node). With N much larger than your node count, you pay overhead for queues you don't need.

N must agree exactly between the producer side and the consumer side — mismatched hashes route to slots the consumer isn't listening on. Pick one value and centralise it (a shared constant, environment variable, or the service-handshake mechanism the BFF already uses for capability negotiation).

Wire the consumer side (BFF)

csharp
opts.AddCritterWatchServices(
    NpgsqlDataSource.Create(connectionString),
    // enableClusterPartitioning defaults to true (#237) — listed here just
    // for completeness. Pass false to opt out.
    configureClusterShardedTopology: topology =>
    {
        // Mix and match per the transports this BFF actually uses.
        topology.UseShardedRabbitQueues("critterwatch", 5);
        // topology.UseShardedAmazonSqsQueues("critterwatch", 5);
        // topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
    });

Without the callback, AddCritterWatchServices throws ArgumentNullException("configureClusterShardedTopology", "… UseShardedRabbitQueues …") — Wolverine's GlobalPartitionedMessageTopology.AssertValidity() requires a sharded external topology be registered at the same time as the message subscription, and a parameter-anchored error makes the missing argument obvious instead of bubbling Wolverine's deeper "external transport topology must be configured" message.

Wire the producer side (every monitored service)

csharp
opts.AddCritterWatchMonitoring(
    critterWatchUri: new Uri("rabbitmq://queue/critterwatch"),
    systemControlUri: new Uri("rabbitmq://queue/my_service_control"),
    configureShardedTopology: topology =>
    {
        // Same value of N. Same transport-specific call.
        topology.UseShardedRabbitQueues("critterwatch", 5);
    });

Once both sides ship, ICritterWatchMessage traffic (ServiceUpdates, AgentHealthReport, ShardStatesChanged, …) flows over the sharded slots. Heartbeats (WolverineHeartbeat) and MessageHandlingMetrics keep flowing over the unsharded critterwatch URI you've always passed — the BFF deliberately doesn't shard those, and a sharded slot with no listener would dead-letter them.

What if I only ship one side?

The producer and consumer hooks are independent rollouts. Both have a default-off path so half-finished migrations are graceful:

ProducerConsumerWhat happens
shardedshardedFull per-service single-writer. Recommended for multi-BFF deployments.
shardedsingle (default)Producer's ICritterWatchMessage lands on the sharded slots but no BFF is listening on them — messages stall on the broker. Don't roll out the producer side until the BFF is on the matching N.
single (default)shardedBFF still listens on the legacy single critterwatch queue alongside the sharded slots. Older monitored services keep working untouched. Roll out the consumer side first.
single (default)single (default)Single-queue legacy path. The BFF's listener is ListenOnlyAtLeader()-pinned (see below) so multi-node BFFs don't split-brain on it.

Legacy single-queue listener is leader-pinned

Even without partitioning, the BFF's ListenToRabbitQueue("critterwatch") and ListenToSqsQueue("critterwatch") call .ListenOnlyAtLeader(). In a single-node deployment that's identical to the pre-leader-aware default (the sole node is the leader). In a multi-node deployment, only one node consumes the legacy queue at a time — preventing the optimistic-concurrency retry storms and split-brain ServiceSummary processing that competing consumers on a single queue would otherwise cause. The sharded slots stay leader-agnostic; only this back-compat queue is leader-pinned.

Load balancer requirements

Health endpoints are LB-appropriate (each node serves /health); the boot-smoke CI gate (#216) asserts the same endpoint reports Healthy.

No sticky sessions required. The Redis backplane fans every SignalR send to every node, so a client that connects to node B receives updates produced on node A. The same property holds for Azure SignalR. Configure the LB for plain round-robin (or least-connections) over WebSocket — sticky sessions add no value and can mask backplane misconfiguration.

Cluster correctness audit (#217)

The 7 BackgroundServices in CritterWatch.Services are classified as follows:

ServiceClassificationWhy
MetricsAlertEvaluatorcluster-singletonpersists alert records + publishes lifecycle messages
ProjectionAlertEvaluatorcluster-singletonsame shape
PrometheusScrapingServicecluster-singletonexternal HTTP fetch + persistence
MetricsIdleReEvaluatorcluster-singletonre-publishes rollups
StateRefreshServiceper-node OKrefreshes its own connected clients; backplane fans out
AlertBatchAccumulatorper-node OKbatches what this node received; backplane fans out
SignalRBatchAccumulatorper-node OKsame shape

Released under the MIT License.