Skip to content

Alerts

CritterWatch implements a fully event-sourced alert lifecycle. Every alert transition — raised, elevated, reduced, resolved, cleared — is stored as an immutable event in Marten, providing a complete audit trail of system health over time.

Alert Lifecycle

States

StateDescription
RaisedThreshold first exceeded. Initial alert created.
ElevatedCondition persists beyond the escalation period. Severity increased.
ReducedCondition is improving but not yet resolved.
ResolvedCondition has cleared automatically (system-condition alerts only).
ClearedAlert acknowledged and closed by an operator.

System-condition alerts (DLQ counts, projection lag, circuit breakers) auto-resolve when the underlying condition clears. Operational alerts (node ejection, manual actions) require explicit operator acknowledgment.

Alert Types

Dead Letter Queue Alerts

Triggered when the DLQ count for a service or message type exceeds a threshold:

  • Warning — DLQ count exceeds DeadLetterQueueWarningCount
  • Critical — DLQ count exceeds DeadLetterQueueCriticalCount

Auto-resolves when DLQ count drops below the threshold.

Projection Stall Alerts

Triggered when a projection shard stops advancing:

  • Warning — lag exceeds ProjectionLagWarningSeconds
  • Critical — lag exceeds ProjectionLagCriticalSeconds, or shard appears fully stalled

Auto-resolves when the projection resumes advancing and lag returns below threshold.

Agent Health Alerts

Triggered when an agent reports unhealthy status:

  • Warning — agent unhealthy for AgentUnhealthyWarningCount consecutive checks
  • Critical — agent unhealthy for AgentUnhealthyCriticalCount consecutive checks

Auto-resolves when the agent reports healthy.

Circuit Breaker Alerts

Triggered immediately when a circuit breaker trips on any endpoint. Severity is always Critical. Auto-resolves when the circuit breaker resets.

Back Pressure Alerts

Triggered when back pressure activates on any endpoint. Auto-resolves when back pressure lifts.

Alerts Page

The Alerts page shows all active alerts across all services, with filters:

  • Status — Raised, Elevated, Reduced, Resolved, Cleared, or All
  • Severity — Warning or Critical
  • Service — scope to a specific service
  • Type — DLQ, Projection, Agent, CircuitBreaker, BackPressure

The Active tab shows only open alerts requiring attention. The History tab shows all alerts including resolved and cleared.

Alert Detail

Click an alert to open the detail panel:

State Timeline

A chronological list of all state transitions for this alert, showing:

  • Timestamp of each transition
  • From state → To state
  • The metric value that triggered the transition (e.g., "DLQ count: 47")
  • For operator actions: who took the action and any notes

Remediation Actions

Each alert includes context-appropriate action buttons:

Alert TypeAvailable Actions
DLQ AlertReplay All, Discard All, View DLQ
Projection StallRestart Projection, Rebuild Projection, View Projection
Agent UnhealthyView Service, Eject Node
Circuit BreakerView Endpoint

Acknowledge / Snooze / Clear

Acknowledge — mark the alert as acknowledged without clearing it. The alert remains visible but is no longer considered "unattended."

Snooze — suppress the alert for a specified duration (1 hour, 4 hours, 24 hours). The alert will resurface after the snooze expires.

Clear — close the alert and record the operator action in the audit trail. A note can be added explaining the resolution.

How Metrics-Based Alerts Are Determined

Some alerts (DLQ count, projection lag, agent health, circuit breaker) are driven by direct counts and trigger as soon as a configured threshold is crossed. Metrics-based alertsthroughput abnormal and execution time abnormal — are different. They compare a current reading against a baseline and only raise when the deviation is large enough and has persisted long enough.

The baseline cascade

For each (service, message type) pair the evaluator picks the effective baseline in this order:

  1. Observed history — the average for this service / message type computed from MetricsSample documents in the CritterWatch event store. Only used when it is mature:
    • the oldest sample is at least BaselineMinimumDays old (default 10),
    • there are at least BaselineMinimumSamples buckets (default 100), and
    • for throughput, the observed value is above BaselineMinimumThroughputPerHour (default 1.0/hr) — this stops a single fractional sample from triggering huge multipliers.
  2. Declared baseline — a value supplied by an operator. Comes from one of two sources:
    • configureBaselines on AddCritterWatchMonitoring in the monitored application — see Registration › Declared Baselines.
    • The CritterWatch UI under Settings → Alert Configuration → Baselines.
  3. None → the alert is suppressed during the warmup window. This is the default behaviour for any newly-monitored service that hasn't shipped a declared baseline. Set SuppressThroughputAlertsDuringWarmup / SuppressExecTimeAlertsDuringWarmup to false to opt out (rarely needed — it will alert noisily until the service settles in).

Each baseline change is recorded as a ThroughputBaselineChanged / ExecTimeBaselineChanged event so the audit trail makes the provenance clear: Source = ServiceCapabilities for values advertised by the monitored app at startup, Source = Operator for values typed into the UI.

Threshold multipliers

Once an effective baseline exists, the evaluator compares current readings against multipliers / percentages:

SettingDefaultMeaning
ThroughputWarningMultiplier3Warn when current rate is ≥ 3× baseline
ThroughputCriticalMultiplier10Critical when current rate is ≥ 10× baseline
ExecTimeWarningPercent50Warn when avg exec time is ≥ baseline + 50%
ExecTimeCriticalPercent200Critical when avg exec time is ≥ baseline + 200%
FailureRateWarningPercent5Warn when failures > 5% of executions
FailureRateCriticalPercent20Critical when failures > 20% of executions
DlqRateWarningPerHour10Warn when DLQ rate ≥ 10/hr
DlqRateCriticalPerHour50Critical when DLQ rate ≥ 50/hr

Hysteresis (K-of-N) — noise suppression

A single noisy evaluation never raises or resolves an alert by itself. Instead the evaluator tracks consecutive passes:

SettingDefaultMeaning
HysteresisRaisePasses2A breach must be observed in this many consecutive passes before the alert is raised (or its severity changed).
HysteresisResolvePasses2A below-threshold reading must be observed in this many consecutive passes before an active alert auto-resolves.

With the default 30-second evaluator cadence this means ~60 seconds of confirmation either way before any state change reaches the user. Increase the values to be more conservative; setting either to 1 disables that side of the hysteresis.

Cascade order

Every metrics-based threshold and every baseline-policy knob supports the same three-level cascade — most specific wins:

  1. Per-message-type override (MessageTypeAlertThresholds)
  2. Per-service override (ServiceMetricsAlertOverrides)
  3. Global default (MetricsAlertDefaults)

That includes the multipliers, the failure-rate / DLQ thresholds, the hysteresis pass counts, the warmup-suppression flags, and the declared baselines themselves.

Editing Thresholds

There are two ways to change any of the settings above.

From the CritterWatch UI

Settings → Alert Configuration exposes three tabs matching the cascade:

  • Global Defaults — applied to every service unless overridden.
  • Service Overrides — pick a service and override any global value. Leave fields blank to inherit from the global default.
  • Message Type Overrides — pick a service + message type combination and override any value. Leave fields blank to inherit from the service or global default.

Edits are stored as documents in the CritterWatch event store and take effect on the next evaluator pass (≤30s). Every edit produces an AlertConfigChanged audit entry, and edits to declared baselines additionally emit ThroughputBaselineChanged / ExecTimeBaselineChanged events with Source = Operator.

Programmatically — configureBaselines

For declared baselines specifically, the cleanest way to seed values for a fresh service is at the monitored side:

csharp
opts.AddCritterWatchMonitoring(
    critterWatchUri,
    systemControlUri,
    configureBaselines: baselines => baselines
        .ForService(throughputPerHour: 200, avgExecTimeMs: 25)
        .For<TripBooked>(throughputPerHour: 50, avgExecTimeMs: 40)
);

These flow over on first contact, and an operator can later refine them through the UI (the operator-supplied value will then take precedence in the cascade — configureBaselines only seeds; the UI is authoritative).

Preset profiles

Two convenience presets are available from the Settings page:

  • Production Profile — strict thresholds, suitable for production environments
  • Development Profile — relaxed thresholds, reduces noise during development

See Configuration Reference for the full programmatic API.

Released under the MIT License.