Alert System

The alert engine runs inside the CritterWatch console. It evaluates incoming telemetry against thresholds, manages a Raised → Elevated → Reduced → Resolved → Cleared lifecycle, and pushes alerts to the browser and to any configured notification channels.

This page is the integrator's view: what's evaluated, where the configuration lives, and what gets persisted. If you're tuning thresholds for an actual deployment, the operator-facing surface is at Alert Configuration.

What gets evaluated

Source	What triggers it
Dead-letter alerts	DLQ count or DLQ-rate-per-hour exceeds a threshold for a service / message type.
Projection lag alerts	A projection shard is too far behind the high-water-mark, or hasn't advanced for too long (stale).
Agent health alerts	Agent reports `Degraded` or `Offline` for `N` consecutive 60-second probes.
Circuit breaker alerts	A circuit breaker trips on any endpoint. Always Critical.
Back pressure alerts	Back pressure activates on any endpoint.
Throughput-abnormal alerts	Current throughput rate exceeds the baseline by a configured multiplier — only raises after enough confirming evaluations (hysteresis).
Execution-time abnormal	Average exec time exceeds the baseline by a configured percentage; same hysteresis.

The threshold-multiplier alerts (throughput, execution time) compare against a baseline — either observed from accumulated history or declared by the monitored service via configureBaselines. See Alerts → Threshold Configuration for the cascade.

Configuration cascade

For every threshold and policy knob:

Per-message-type override wins.
Else per-service override.
Else global default.

Edits made through the Alert Configuration UI take effect on the next evaluator pass (≤ 30 seconds). Edits to declared baselines emit a ThroughputBaselineChanged / ExecTimeBaselineChanged audit event.

Alert lifecycle

Raised → Elevated → Reduced → Resolved → Cleared

Raised — first time the threshold is breached (after hysteresis confirms).
Elevated — the condition persists past an escalation duration; severity bumps from Warning to Critical.
Reduced — the condition is improving but hasn't returned below threshold yet.
Resolved — for system-condition alerts (DLQ, projection lag, etc.), this is automatic when the metric returns below threshold.
Cleared — operator explicitly closes the alert. Required for operational alerts (manual operations, node ejections); optional for system alerts.

Each transition records a fact in the alert's history — the metric value at the time, the timestamp, and (for operator transitions) who clicked it.

Auto-resolution vs manual clear

Alert kind	Auto-resolves?
DLQ count / rate	Yes — when the count drops below threshold.
Projection lag / stall	Yes — when the projection resumes advancing.
Agent health	Yes — when the agent reports `Healthy`.
Circuit breaker	Yes — when the breaker resets.
Back pressure	Yes — when back pressure lifts.
Throughput / exec-time abnormal	Yes — when readings return below threshold for the resolve-hysteresis window.
Operational (DLQ replay, manual operation)	No — operator clears explicitly.

What's persisted

Each alert is stored as a snapshot document. Full transition history is embedded in the document — there's no separate per-transition table to query. The schema lives in the critterwatch Marten schema.

The audit log records every operator action on alerts (Acknowledge, Snooze, Clear) with optional notes. Configuration edits are tracked separately on the Alert Configuration → History tab.

Notification channels

Alerts are pushed to channels configured in Settings → Notification Channels. Currently supported:

Slack (Incoming Webhook)
Discord (Webhook)
Microsoft Teams (Incoming Webhook)

Each channel has an alert-severity allowlist — you can route Critical to PagerDuty (when supported) and Warning to Slack, for example. Webhook URLs are stored in the CritterWatch database (PostgreSQL) and never leave the host.

Email, generic webhook, and PagerDuty channels are on the roadmap.

Suppression and snooze

Snooze — operator silences an active alert for a duration (1h / 4h / 24h). The alert resurfaces after expiry.
Per-shard suppress — projection-shard-level suppression for known-broken / known-rebuilding shards. Configured under Alert Configuration → Per-Shard Overrides.
Warmup suppression — by default, throughput and exec-time alerts are suppressed until the service has accumulated enough samples for an observed baseline. Toggleable per service.

Alert System ​

What gets evaluated ​

Configuration cascade ​

Alert lifecycle ​

Auto-resolution vs manual clear ​

What's persisted ​

Notification channels ​

Suppression and snooze ​