Skip to content

How CritterWatch Works

This page is the operator-level mental model. It covers what infrastructure CritterWatch needs, how data gets from your services into the console, and how fresh that data is — enough to plan a deployment, debug a misconfigured service, and read the dashboards without surprises.

If you're integrating Wolverine.CritterWatch into a service for the first time, start with the Quick Start. If you're deploying the console, see Hosting Guide.

The shape of a deployment

CritterWatch sits next to the services it monitors. There are three pieces:

  • Each monitored service runs the Wolverine.CritterWatch package. It hooks into Wolverine's runtime, batches up state, and publishes a snapshot once per second.
  • CritterWatch server is a single ASP.NET Core process. It receives telemetry, persists it to PostgreSQL, runs the alert evaluator, and serves the web console.
  • Operator commands (replay, pause listener, rebuild projection, etc.) flow back from the browser through the same transport to the target service.

Infrastructure you need

ComponentPurposeNotes
PostgreSQL 14+Backs the CritterWatch server's own bookkeeping (alerts, audit log, timeline).Dedicated database; CritterWatch won't touch your application's databases.
Wolverine transportCarries telemetry from services to the console and commands back. RabbitMQ is recommended.Services and the console must use the same transport.
CritterWatch serverSingle ASP.NET Core process; serves the browser UI and runs alert evaluation.Not currently HA — run behind a process manager that restarts on failure.

Your monitored services don't need any new database or new ports — they reuse the Wolverine transport they already have. CritterWatch never opens a connection to your service's databases.

How fresh is the data?

End-to-end latency from a state change in your service to a pixel changing in the browser is typically 1–2 seconds:

StageTypical time
Wolverine event → observer batch~immediate (lock-free queue push)
Observer batch → telemetry publish≤ 1 second (publish timer)
Transport delivery + console handling + DB write~100ms
Browser update via SignalR~100ms

The 1-second batching window is the dominant lag. It's deliberate: per-event publishing would multiply RabbitMQ traffic without helping the operator, since the human eye can't make use of sub-second updates anyway.

A few things are reported on a slower cadence and lag accordingly:

  • Heartbeat dot — 30 second cadence; amber after 60s, red after 150s. See Services → Heartbeat dot.
  • Agent health probe — ~60 second cadence. The probe is active so silent agent failures eventually surface.
  • Broker health probe — ~60 second cadence per transport.
  • Pending EF Core migrations — on demand only; the operator clicks a button. Reading __EFMigrationsHistory opens a synchronous DB connection.

What gets sent to the console

The monitored service publishes its own state — CritterWatch never reaches in to scrape data:

  • Identity & topology — service name, Wolverine version, registered handlers, endpoints, message stores, event stores, tenants.
  • Live counts — inbox / outbox / DLQ depth, per-tenant counters, persistence queue depths.
  • Lifecycle changes — node added/removed, leader elected, agent assigned/stopped, projection shard advanced, circuit breaker tripped, back pressure activated.
  • Health — agent health probe results, broker reachability, service heartbeat.

Connection strings, database credentials, and message bodies are not included in the regular telemetry stream. (Message bodies are only fetched when an operator opens a specific dead-letter or scheduled message for inspection.)

What about the data CritterWatch keeps?

The console persists its own bookkeeping in PostgreSQL — alerts, the audit log, timeline entries, and a rolling history of metric buckets. Retention is configurable from Settings → Data Retention; defaults are 30 days for the timeline and audit log, 1000 buckets per service per message-type for metrics.

Service state itself is reconstructable from telemetry — if you wipe the CritterWatch database, the next telemetry snapshot from each running service rebuilds it. The historical timeline and audit log are the parts that are truly persistent.

Failure isolation

Three properties to keep in mind:

  • Services run independently of CritterWatch. If the console is down, your services keep processing messages. Telemetry queues up in the transport and replays on reconnect.
  • Services don't trust each other through CritterWatch. Commands are routed to a specific service's queue; one service can't issue commands on behalf of another.
  • CritterWatch failures don't cascade. The transport queue, not direct calls, decouples services from the console.

What CritterWatch does require to function: PostgreSQL must be reachable for queries (alerts, audit log, etc.), and the SignalR WebSocket must reach the browser. Neither failure affects monitored services.

What kind of scale this targets

CritterWatch is designed for dozens to a few hundred monitored services. The primary bottleneck is the RabbitMQ listener and the PostgreSQL write throughput on the console side; the publish-side cost on each monitored service is negligible (a 1-second batched publish).

The CritterWatch server is currently a single instance — there is no built-in clustering. For very large deployments, contact JasperFx for guidance.

Topics

  • Message Flow — what triggers a console update and what triggers a service-side command, end to end.
  • Multi-Tenancy — how tenants are discovered and scoped in the UI; runtime tenant management.
  • EF Core Monitoring — how DbContexts get reported and what the badges on the Storage tab mean.
  • Source Generator — how CritterWatch.SourceGeneration discovers aggregates, handlers, and sagas at build time, and how Wolverine.CritterWatch merges the per-project manifests at runtime.

Released under the MIT License.