Alerts

CritterWatch implements a fully event-sourced alert lifecycle. Every alert transition — raised, elevated, reduced, resolved, cleared — is stored as an immutable event in Marten, providing a complete audit trail of system health over time.

Alert Lifecycle

States

State	Description
Raised	Threshold first exceeded. Initial alert created.
Elevated	Condition persists beyond the escalation period. Severity increased.
Reduced	Condition is improving but not yet resolved.
Resolved	Condition has cleared automatically (system-condition alerts only).
Cleared	Alert acknowledged and closed by an operator.

System-condition alerts (DLQ counts, projection lag, circuit breakers) auto-resolve when the underlying condition clears. Operational alerts (node ejection, manual actions) require explicit operator acknowledgment.

Alert Types

Dead Letter Queue Alerts

Triggered when the DLQ count for a service or message type exceeds a threshold:

Warning — DLQ count exceeds DeadLetterQueueWarningCount
Critical — DLQ count exceeds DeadLetterQueueCriticalCount

Auto-resolves when DLQ count drops below the threshold.

Projection Stall Alerts

Triggered when a projection shard stops advancing:

Warning — lag exceeds ProjectionLagWarningSeconds
Critical — lag exceeds ProjectionLagCriticalSeconds, or shard appears fully stalled

Auto-resolves when the projection resumes advancing and lag returns below threshold.

Agent Health Alerts

Triggered when an agent reports unhealthy status:

Warning — agent unhealthy for AgentUnhealthyWarningCount consecutive checks
Critical — agent unhealthy for AgentUnhealthyCriticalCount consecutive checks

Auto-resolves when the agent reports healthy.

Circuit Breaker Alerts

Triggered immediately when a circuit breaker trips on any endpoint. Severity is always Critical. Auto-resolves when the circuit breaker resets.

Back Pressure Alerts

Triggered when back pressure activates on any endpoint. Auto-resolves when back pressure lifts.

Alerts Page

The Alerts page shows all active alerts across all services, with filters:

Status — Raised, Elevated, Reduced, Resolved, Cleared, or All
Severity — Warning or Critical
Service — scope to a specific service
Type — DLQ, Projection, Agent, CircuitBreaker, BackPressure

The Active tab shows only open alerts requiring attention. The History tab shows all alerts including resolved and cleared.

Alert Detail

Click an alert to open the detail panel:

State Timeline

A chronological list of all state transitions for this alert, showing:

Timestamp of each transition
From state → To state
The metric value that triggered the transition (e.g., "DLQ count: 47")
For operator actions: who took the action and any notes

Remediation Actions

Each alert includes context-appropriate action buttons:

Alert Type	Available Actions
DLQ Alert	Replay All, Discard All, View DLQ
Projection Stall	Restart Projection, Rebuild Projection, View Projection
Agent Unhealthy	View Service, Eject Node
Circuit Breaker	View Endpoint

Acknowledge / Snooze / Clear

Acknowledge — mark the alert as acknowledged without clearing it. The alert remains visible but is no longer considered "unattended."

Snooze — suppress the alert for a specified duration (1 hour, 4 hours, 24 hours). The alert will resurface after the snooze expires.

Clear — close the alert and record the operator action in the audit trail. A note can be added explaining the resolution.

How Metrics-Based Alerts Are Determined

Some alerts (DLQ count, projection lag, agent health, circuit breaker) are driven by direct counts and trigger as soon as a configured threshold is crossed. Metrics-based alerts — throughput abnormal and execution time abnormal — are different. They compare a current reading against a baseline and only raise when the deviation is large enough and has persisted long enough.

The baseline cascade

For each (service, message type) pair the evaluator picks the effective baseline in this order:

Observed history — the average for this service / message type computed from MetricsSample documents in the CritterWatch event store. Only used when it is mature:
- the oldest sample is at least BaselineMinimumDays old (default 10),
- there are at least BaselineMinimumSamples buckets (default 100), and
- for throughput, the observed value is above BaselineMinimumThroughputPerHour (default 1.0/hr) — this stops a single fractional sample from triggering huge multipliers.
Declared baseline — a value supplied by an operator. Comes from one of two sources:
- configureBaselines on AddCritterWatchMonitoring in the monitored application — see Registration › Declared Baselines.
- The CritterWatch UI under Settings → Alert Configuration → Baselines.
None → the alert is suppressed during the warmup window. This is the default behaviour for any newly-monitored service that hasn't shipped a declared baseline. Set SuppressThroughputAlertsDuringWarmup / SuppressExecTimeAlertsDuringWarmup to false to opt out (rarely needed — it will alert noisily until the service settles in).

Each baseline change is recorded as a ThroughputBaselineChanged / ExecTimeBaselineChanged event so the audit trail makes the provenance clear: Source = ServiceCapabilities for values advertised by the monitored app at startup, Source = Operator for values typed into the UI.

Threshold multipliers

Once an effective baseline exists, the evaluator compares current readings against multipliers / percentages:

Setting	Default	Meaning
`ThroughputWarningMultiplier`	`3`	Warn when current rate is ≥ 3× baseline
`ThroughputCriticalMultiplier`	`10`	Critical when current rate is ≥ 10× baseline
`ExecTimeWarningPercent`	`50`	Warn when avg exec time is ≥ baseline + 50%
`ExecTimeCriticalPercent`	`200`	Critical when avg exec time is ≥ baseline + 200%
`FailureRateWarningPercent`	`5`	Warn when failures > 5% of executions
`FailureRateCriticalPercent`	`20`	Critical when failures > 20% of executions
`DlqRateWarningPerHour`	`10`	Warn when DLQ rate ≥ 10/hr
`DlqRateCriticalPerHour`	`50`	Critical when DLQ rate ≥ 50/hr

Hysteresis (K-of-N) — noise suppression

A single noisy evaluation never raises or resolves an alert by itself. Instead the evaluator tracks consecutive passes:

Setting	Default	Meaning
`HysteresisRaisePasses`	`2`	A breach must be observed in this many consecutive passes before the alert is raised (or its severity changed).
`HysteresisResolvePasses`	`2`	A below-threshold reading must be observed in this many consecutive passes before an active alert auto-resolves.

With the default 30-second evaluator cadence this means ~60 seconds of confirmation either way before any state change reaches the user. Increase the values to be more conservative; setting either to 1 disables that side of the hysteresis.

Cascade order

Every metrics-based threshold and every baseline-policy knob supports the same three-level cascade — most specific wins:

Per-message-type override (MessageTypeAlertThresholds)
Per-service override (ServiceMetricsAlertOverrides)
Global default (MetricsAlertDefaults)

That includes the multipliers, the failure-rate / DLQ thresholds, the hysteresis pass counts, the warmup-suppression flags, and the declared baselines themselves.

Editing Thresholds

There are two ways to change any of the settings above.

From the CritterWatch UI

Settings → Alert Configuration exposes three tabs matching the cascade:

Global Defaults — applied to every service unless overridden.
Service Overrides — pick a service and override any global value. Leave fields blank to inherit from the global default.
Message Type Overrides — pick a service + message type combination and override any value. Leave fields blank to inherit from the service or global default.

Edits are stored as documents in the CritterWatch event store and take effect on the next evaluator pass (≤30s). Every edit produces an AlertConfigChanged audit entry, and edits to declared baselines additionally emit ThroughputBaselineChanged / ExecTimeBaselineChanged events with Source = Operator.

Programmatically — `configureBaselines`

For declared baselines specifically, the cleanest way to seed values for a fresh service is at the monitored side:

csharp

opts.AddCritterWatchMonitoring(
    critterWatchUri,
    systemControlUri,
    configureBaselines: baselines => baselines
        .ForService(throughputPerHour: 200, avgExecTimeMs: 25)
        .For<TripBooked>(throughputPerHour: 50, avgExecTimeMs: 40)
);

These flow over on first contact, and an operator can later refine them through the UI (the operator-supplied value will then take precedence in the cascade — configureBaselines only seeds; the UI is authoritative).

Preset profiles

Two convenience presets are available from the Settings page:

Production Profile — strict thresholds, suitable for production environments
Development Profile — relaxed thresholds, reduces noise during development

See Configuration Reference for the full programmatic API.

Alerts ​

Alert Lifecycle ​

States ​

Alert Types ​

Dead Letter Queue Alerts ​

Projection Stall Alerts ​

Agent Health Alerts ​

Circuit Breaker Alerts ​

Back Pressure Alerts ​

Alerts Page ​

Alert Detail ​

State Timeline ​

Remediation Actions ​

Acknowledge / Snooze / Clear ​

How Metrics-Based Alerts Are Determined ​

The baseline cascade ​

Threshold multipliers ​

Hysteresis (K-of-N) — noise suppression ​

Cascade order ​

Editing Thresholds ​

From the CritterWatch UI ​

Programmatically — configureBaselines ​

Preset profiles ​