Alerts
CritterWatch implements a fully event-sourced alert lifecycle. Every alert transition — raised, elevated, reduced, resolved, cleared — is stored as an immutable event in Marten, providing a complete audit trail of system health over time.
Alert Lifecycle
States
| State | Description |
|---|---|
| Raised | Threshold first exceeded. Initial alert created. |
| Elevated | Condition persists beyond the escalation period. Severity increased. |
| Reduced | Condition is improving but not yet resolved. |
| Resolved | Condition has cleared automatically (system-condition alerts only). |
| Cleared | Alert acknowledged and closed by an operator. |
System-condition alerts (DLQ counts, projection lag, circuit breakers) auto-resolve when the underlying condition clears. Operational alerts (node ejection, manual actions) require explicit operator acknowledgment.
Alert Types
Dead Letter Queue Alerts
Triggered when the DLQ count for a service or message type exceeds a threshold:
- Warning — DLQ count exceeds
DeadLetterQueueWarningCount - Critical — DLQ count exceeds
DeadLetterQueueCriticalCount
Auto-resolves when DLQ count drops below the threshold.
Projection Stall Alerts
Triggered when a projection shard stops advancing:
- Warning — lag exceeds
ProjectionLagWarningSeconds - Critical — lag exceeds
ProjectionLagCriticalSeconds, or shard appears fully stalled
Auto-resolves when the projection resumes advancing and lag returns below threshold.
Agent Health Alerts
Triggered when an agent reports unhealthy status:
- Warning — agent unhealthy for
AgentUnhealthyWarningCountconsecutive checks - Critical — agent unhealthy for
AgentUnhealthyCriticalCountconsecutive checks
Auto-resolves when the agent reports healthy.
Circuit Breaker Alerts
Triggered immediately when a circuit breaker trips on any endpoint. Severity is always Critical. Auto-resolves when the circuit breaker resets.
Back Pressure Alerts
Triggered when back pressure activates on any endpoint. Auto-resolves when back pressure lifts.
Alerts Page
The Alerts page shows all active alerts across all services, with filters:
- Status — Raised, Elevated, Reduced, Resolved, Cleared, or All
- Severity — Warning or Critical
- Service — scope to a specific service
- Type — DLQ, Projection, Agent, CircuitBreaker, BackPressure
The Active tab shows only open alerts requiring attention. The History tab shows all alerts including resolved and cleared.
Alert Detail
Click an alert to open the detail panel:
State Timeline
A chronological list of all state transitions for this alert, showing:
- Timestamp of each transition
- From state → To state
- The metric value that triggered the transition (e.g., "DLQ count: 47")
- For operator actions: who took the action and any notes
Remediation Actions
Each alert includes context-appropriate action buttons:
| Alert Type | Available Actions |
|---|---|
| DLQ Alert | Replay All, Discard All, View DLQ |
| Projection Stall | Restart Projection, Rebuild Projection, View Projection |
| Agent Unhealthy | View Service, Eject Node |
| Circuit Breaker | View Endpoint |
Acknowledge / Snooze / Clear
Acknowledge — mark the alert as acknowledged without clearing it. The alert remains visible but is no longer considered "unattended."
Snooze — suppress the alert for a specified duration (1 hour, 4 hours, 24 hours). The alert will resurface after the snooze expires.
Clear — close the alert and record the operator action in the audit trail. A note can be added explaining the resolution.
How Metrics-Based Alerts Are Determined
Some alerts (DLQ count, projection lag, agent health, circuit breaker) are driven by direct counts and trigger as soon as a configured threshold is crossed. Metrics-based alerts — throughput abnormal and execution time abnormal — are different. They compare a current reading against a baseline and only raise when the deviation is large enough and has persisted long enough.
The baseline cascade
For each (service, message type) pair the evaluator picks the effective baseline in this order:
- Observed history — the average for this service / message type computed from
MetricsSampledocuments in the CritterWatch event store. Only used when it is mature:- the oldest sample is at least
BaselineMinimumDaysold (default 10), - there are at least
BaselineMinimumSamplesbuckets (default 100), and - for throughput, the observed value is above
BaselineMinimumThroughputPerHour(default 1.0/hr) — this stops a single fractional sample from triggering huge multipliers.
- the oldest sample is at least
- Declared baseline — a value supplied by an operator. Comes from one of two sources:
configureBaselinesonAddCritterWatchMonitoringin the monitored application — see Registration › Declared Baselines.- The CritterWatch UI under Settings → Alert Configuration → Baselines.
- None → the alert is suppressed during the warmup window. This is the default behaviour for any newly-monitored service that hasn't shipped a declared baseline. Set
SuppressThroughputAlertsDuringWarmup/SuppressExecTimeAlertsDuringWarmuptofalseto opt out (rarely needed — it will alert noisily until the service settles in).
Each baseline change is recorded as a ThroughputBaselineChanged / ExecTimeBaselineChanged event so the audit trail makes the provenance clear: Source = ServiceCapabilities for values advertised by the monitored app at startup, Source = Operator for values typed into the UI.
Threshold multipliers
Once an effective baseline exists, the evaluator compares current readings against multipliers / percentages:
| Setting | Default | Meaning |
|---|---|---|
ThroughputWarningMultiplier | 3 | Warn when current rate is ≥ 3× baseline |
ThroughputCriticalMultiplier | 10 | Critical when current rate is ≥ 10× baseline |
ExecTimeWarningPercent | 50 | Warn when avg exec time is ≥ baseline + 50% |
ExecTimeCriticalPercent | 200 | Critical when avg exec time is ≥ baseline + 200% |
FailureRateWarningPercent | 5 | Warn when failures > 5% of executions |
FailureRateCriticalPercent | 20 | Critical when failures > 20% of executions |
DlqRateWarningPerHour | 10 | Warn when DLQ rate ≥ 10/hr |
DlqRateCriticalPerHour | 50 | Critical when DLQ rate ≥ 50/hr |
Hysteresis (K-of-N) — noise suppression
A single noisy evaluation never raises or resolves an alert by itself. Instead the evaluator tracks consecutive passes:
| Setting | Default | Meaning |
|---|---|---|
HysteresisRaisePasses | 2 | A breach must be observed in this many consecutive passes before the alert is raised (or its severity changed). |
HysteresisResolvePasses | 2 | A below-threshold reading must be observed in this many consecutive passes before an active alert auto-resolves. |
With the default 30-second evaluator cadence this means ~60 seconds of confirmation either way before any state change reaches the user. Increase the values to be more conservative; setting either to 1 disables that side of the hysteresis.
Cascade order
Every metrics-based threshold and every baseline-policy knob supports the same three-level cascade — most specific wins:
- Per-message-type override (
MessageTypeAlertThresholds) - Per-service override (
ServiceMetricsAlertOverrides) - Global default (
MetricsAlertDefaults)
That includes the multipliers, the failure-rate / DLQ thresholds, the hysteresis pass counts, the warmup-suppression flags, and the declared baselines themselves.
Editing Thresholds
There are two ways to change any of the settings above.
From the CritterWatch UI
Settings → Alert Configuration exposes three tabs matching the cascade:
- Global Defaults — applied to every service unless overridden.
- Service Overrides — pick a service and override any global value. Leave fields blank to inherit from the global default.
- Message Type Overrides — pick a service + message type combination and override any value. Leave fields blank to inherit from the service or global default.
Edits are stored as documents in the CritterWatch event store and take effect on the next evaluator pass (≤30s). Every edit produces an AlertConfigChanged audit entry, and edits to declared baselines additionally emit ThroughputBaselineChanged / ExecTimeBaselineChanged events with Source = Operator.
Programmatically — configureBaselines
For declared baselines specifically, the cleanest way to seed values for a fresh service is at the monitored side:
opts.AddCritterWatchMonitoring(
critterWatchUri,
systemControlUri,
configureBaselines: baselines => baselines
.ForService(throughputPerHour: 200, avgExecTimeMs: 25)
.For<TripBooked>(throughputPerHour: 50, avgExecTimeMs: 40)
);These flow over on first contact, and an operator can later refine them through the UI (the operator-supplied value will then take precedence in the cascade — configureBaselines only seeds; the UI is authoritative).
Preset profiles
Two convenience presets are available from the Settings page:
- Production Profile — strict thresholds, suitable for production environments
- Development Profile — relaxed thresholds, reduces noise during development
See Configuration Reference for the full programmatic API.
