Operational History
This page is about the history CritterWatch keeps — the timeline you scroll through after an incident, the audit log of every operator click, the alert lifecycle from raised to cleared. The mechanism behind it is event sourcing in Marten, but as an operator the parts you actually care about are: what's persisted, how long, and how to query it.
What's kept
| Surface | What it records | Retention |
|---|---|---|
| Activity Timeline | Every notable event from every monitored service — node added, leader elected, agent reassigned, projection rebuilt, DLQ replay, alert raised. Reverse-chronological feed. | Configurable; default 30 days. |
| Audit Log | Every operator action that mutated a service. Includes who, when, what, and the parameters. | Configurable; default 30 days. |
| Alert lifecycle | Each alert's full transition history — raised, elevated, reduced, resolved, cleared, with timestamps and the metric value at each transition. | Until the alert is cleared, then retained alongside the timeline. |
| Metrics buckets | Per-service per-message-type throughput / exec-time / DLQ rate samples. Drives the trend charts on the Metrics tab. | Configurable; default 1000 buckets per (service, message type). |
| Live service state | Current snapshot of every service — nodes, agents, endpoints, tenants, configuration. | Always current; reconstructable from telemetry on first contact after a wipe. |
Tunables live on the Settings → Data Retention card.
What you can do with it
Reconstruct an incident
The Timeline + Audit Log together let you answer "what happened, in what order, and who was involved" for any past incident inside the retention window. Filter by service, severity, and time range; export the audit log as CSV for a write-up.
A couple of patterns:
- "When did this start?" — open the Timeline, set the severity to Warning + Critical, look for the first transition near the reported start time.
- "Who changed the threshold?" — Alert Configuration → History tab; that's a separate, narrower history of just configuration edits.
- "What did the alert look like before someone cleared it?" — open the alert detail in the Alerts page. The state timeline shows every transition with the metric values that triggered them; the operator's note (if any) is on the Cleared transition.
Query historical state
Because every change is a fact in time, the Event Store view (UI → Event Store) lets you browse what CritterWatch knew about each service at any point. This is mostly useful during post-mortems — "what did trip-service look like 2 hours before the outage?"
Trust the audit log
Every state-changing operator action (commands list) records an entry with the operator's identity (when authentication is configured), the parameters that were sent, and a timestamp. The entry is recorded before the command leaves the console, so even if the target service rejects the command or the transport is down, you still see the click in the audit log.
Sensitive actions (hard-delete tenant) record extra parameters at the moment of confirmation — the typed tenant id, the database URI, the wall-clock confirmation time. See Audit Log → HardDeleteTenant Entries.
What CritterWatch does not keep
- Message bodies are not persisted by default. Bodies are fetched on demand when an operator opens a specific dead-letter or scheduled message. The DLQ data lives in your service's database, not CritterWatch's.
- Connection strings for tenant databases are not persisted. The console knows the database URI (host + database name) for identification only; the connection string travels through the transport at the moment a tenant is added and is never written to CritterWatch's own database.
- Application data (your domain events, aggregates, read models) lives in your services' databases. CritterWatch reports on the operational events that affect message processing — your business data is untouched.
When the database fills up
The two surfaces that grow with operator activity are the Timeline and the Audit Log. If your CritterWatch database is ballooning, the most likely cause is a high event-rate service generating many timeline entries. Two levers:
- Drop retention. Settings → Data Retention. 30 days is conservative — many teams run with 7 or 14.
- Reduce metrics-bucket retention. Metrics charts are nice-to-have for trend analysis; if you don't reach for them, drop the count. The default 1000 is enough for most uses.
The audit log is intentionally cheap to retain — operator actions are infrequent.
Recovery
Because service state is reconstructable from live telemetry, you can drop and recreate the CritterWatch database without losing the current picture. What you'd lose is history — timeline, audit log, alert lifecycle, metrics buckets. There's no built-in backup/restore today; back up the PostgreSQL database the way you back up any other app database.
