Alert System
The alert engine runs inside the CritterWatch console. It evaluates incoming telemetry against thresholds, manages a Raised → Elevated → Reduced → Resolved → Cleared lifecycle, and pushes alerts to the browser and to any configured notification channels.
This page is the integrator's view: what's evaluated, where the configuration lives, and what gets persisted. If you're tuning thresholds for an actual deployment, the operator-facing surface is at Alert Configuration.
What gets evaluated
| Source | What triggers it |
|---|---|
| Dead-letter alerts | DLQ count or DLQ-rate-per-hour exceeds a threshold for a service / message type. |
| Projection lag alerts | A projection shard is too far behind the high-water-mark, or hasn't advanced for too long (stale). |
| Agent health alerts | Agent reports Degraded or Offline for N consecutive 60-second probes. |
| Circuit breaker alerts | A circuit breaker trips on any endpoint. Always Critical. |
| Back pressure alerts | Back pressure activates on any endpoint. |
| Throughput-abnormal alerts | Current throughput rate exceeds the baseline by a configured multiplier — only raises after enough confirming evaluations (hysteresis). |
| Execution-time abnormal | Average exec time exceeds the baseline by a configured percentage; same hysteresis. |
The threshold-multiplier alerts (throughput, execution time) compare against a baseline — either observed from accumulated history or declared by the monitored service via configureBaselines. See Alerts → Threshold Configuration for the cascade.
Configuration cascade
For every threshold and policy knob:
- Per-message-type override wins.
- Else per-service override.
- Else global default.
Edits made through the Alert Configuration UI take effect on the next evaluator pass (≤ 30 seconds). Edits to declared baselines emit a ThroughputBaselineChanged / ExecTimeBaselineChanged audit event.
Alert lifecycle
Raised → Elevated → Reduced → Resolved → Cleared- Raised — first time the threshold is breached (after hysteresis confirms).
- Elevated — the condition persists past an escalation duration; severity bumps from Warning to Critical.
- Reduced — the condition is improving but hasn't returned below threshold yet.
- Resolved — for system-condition alerts (DLQ, projection lag, etc.), this is automatic when the metric returns below threshold.
- Cleared — operator explicitly closes the alert. Required for operational alerts (manual operations, node ejections); optional for system alerts.
Each transition records a fact in the alert's history — the metric value at the time, the timestamp, and (for operator transitions) who clicked it.
Auto-resolution vs manual clear
| Alert kind | Auto-resolves? |
|---|---|
| DLQ count / rate | Yes — when the count drops below threshold. |
| Projection lag / stall | Yes — when the projection resumes advancing. |
| Agent health | Yes — when the agent reports Healthy. |
| Circuit breaker | Yes — when the breaker resets. |
| Back pressure | Yes — when back pressure lifts. |
| Throughput / exec-time abnormal | Yes — when readings return below threshold for the resolve-hysteresis window. |
| Operational (DLQ replay, manual operation) | No — operator clears explicitly. |
What's persisted
Each alert is stored as a snapshot document. Full transition history is embedded in the document — there's no separate per-transition table to query. The schema lives in the critterwatch Marten schema.
The audit log records every operator action on alerts (Acknowledge, Snooze, Clear) with optional notes. Configuration edits are tracked separately on the Alert Configuration → History tab.
Notification channels
Alerts are pushed to channels configured in Settings → Notification Channels. Currently supported:
- Slack (Incoming Webhook)
- Discord (Webhook)
- Microsoft Teams (Incoming Webhook)
Each channel has an alert-severity allowlist — you can route Critical to PagerDuty (when supported) and Warning to Slack, for example. Webhook URLs are stored in the CritterWatch database (PostgreSQL) and never leave the host.
Email, generic webhook, and PagerDuty channels are on the roadmap.
Suppression and snooze
- Snooze — operator silences an active alert for a duration (1h / 4h / 24h). The alert resurfaces after expiry.
- Per-shard suppress — projection-shard-level suppression for known-broken / known-rebuilding shards. Configured under Alert Configuration → Per-Shard Overrides.
- Warmup suppression — by default, throughput and exec-time alerts are suppressed until the service has accumulated enough samples for an observed baseline. Toggleable per service.
