Role-based access control
CritterWatch ships with an auth-agnostic, ClaimsPrincipal-based RBAC layer (#218). It governs who can view what and — more importantly — who can take state-changing actions (DLQ replay, projection rebuild, tenant ops, chaos toggles, listener pause / restart, etc.) across every monitored service.
Off-mode is the default. Single-tenant deployments and existing hosts that haven't configured RBAC keep working exactly as they did before — every authenticated caller (including anonymous in dev) can do everything the license permits. Enforced mode is a single DI call away.
How CritterWatch decides
CritterWatch never owns identity. The host's authentication layer (OIDC, reverse-proxy header trust, JWT bearer, Windows auth, whatever) builds a ClaimsPrincipal; CritterWatch reads it. A small interface decides what that principal is allowed to do:
public interface ICritterWatchAuthorizer
{
Task<bool> IsAllowedAsync(
ClaimsPrincipal principal,
string capability,
string? resource = null,
CancellationToken ct = default);
}capability— one of theCapabilitiesstring constants. Identifies the thing the caller is trying to do.resource— optional scope: the target service name, projection shard, tenant id, alert stream id, etc. Lets you write rules like "this on-call rotation can clear alerts onTripServicebut notRepairShop."
You implement this interface against whatever rule store fits your environment — LDAP groups, OIDC role claims, a static config file, a database, a custom rule engine. CritterWatch never assumes a schema.
Turning enforcement on
Register your authorizer alongside the rest of the CritterWatch services. The DI extension replaces the off-mode DefaultAllowAuthorizer:
builder.Services.AddCritterWatchAuthorization<MyAuthorizer>();That single line lights up enforcement everywhere CritterWatch ships gates today — HTTP endpoints, MCP action tools, all paths annotated with [RequiresPermission]. No further wiring is needed; the off-mode shim is removed and your authorizer drives every decision.
For incremental rollout (e.g. you want to ship enforced mode but double-check denies against logs first), wrap your authorizer to log-and-allow during a soak window before flipping to log-and-deny.
Off-mode is preserved when no custom authorizer is registered.
AddCritterWatchServices(...)auto-wiresAddCritterWatchAuthorization()withTryAddso the BFF's static codegen always resolves an authorizer; the resultingDefaultAllowAuthorizerreturnstruefor every decision, which is the pre-#218 behaviour. Your explicitAddCritterWatchAuthorization<TAuthorizer>()replaces it.
Fail-closed on missing principal
If enforced mode is on and a request arrives without an authenticated principal (principal?.Identity?.IsAuthenticated != true), CritterWatch rejects the request before calling your authorizer. That short-circuit exists so a misconfigured authentication scheme can't accidentally let unauthenticated callers through against a permissive authorizer.
Off-mode (no custom authorizer registered) keeps allowing anonymous principals — that's the pre-#218 behaviour for single-tenant dev hosts.
What's gated, and how
HTTP endpoints
State-changing operator HTTP endpoints are annotated with the [RequiresPermission] attribute. Wolverine's HTTP codegen weaves the RBAC check into the chain ahead of the endpoint body:
public static class AlertEndpoint
{
[RequiresPermission(Capabilities.AlertClear)]
[WolverinePost("/api/critterwatch/alerts/{alertStreamId}/clear")]
public static async Task<AlertClearedMessage?> ClearAlert(...)
{
// body runs only if the authorizer said yes
}
}On deny the request short-circuits with a 403 Forbidden ProblemDetails response. The denied capability is exposed in the capability extension field so clients can map the response back to a specific missing grant:
{
"status": 403,
"title": "Forbidden",
"detail": "Caller is not authorized for capability \"alert.clear\".",
"capability": "alert.clear"
}No UseExceptionHandler ceremony is required on the host — the frame writes the response directly.
MCP action tools
The cross-application MCP server (CritterWatch.Mcp) exposes 21 RBAC-gated state-changing tools across six families:
| Family | Tools | Capabilities |
|---|---|---|
| DLQ | Replay / Discard | dlq.replay, dlq.discard |
| Projection | Pause / Restart / Rebuild | projection.pause, projection.restart, projection.rebuild |
| Tenant | Add / Enable / Disable / Remove / HardDelete | tenant.add, tenant.enable, tenant.disable, tenant.remove, tenant.hard-delete |
| Alert | Acknowledge / Snooze / Clear | alert.acknowledge, alert.snooze, alert.clear |
| ChaosMonkey | Enable / Disable / SetFailureRate / SetSlowHandler / SetProjectionFailureRate | chaos-monkey.toggle (enable/disable), chaos-monkey.configure (rate / delay knobs) |
| Listener | Pause / Restart / Drain | listener.pause, listener.restart, listener.drain |
Every tool calls the same enforcement helper before publishing the underlying command — the deny envelope is stable JSON the MCP client returns verbatim:
{
"error": "Forbidden",
"message": "Caller is not authorized for capability 'dlq.replay' on resource 'TripService'.",
"capability": "dlq.replay",
"resource": "TripService"
}The MCP transport is configured stateless so IHttpContextAccessor reflects the current tool invocation's principal — see the MCP integration page for details.
The capability catalog
Strings, not enums — keep the wire format stable across audit-log replay. Add new capabilities at the bottom of Capabilities; never renumber or repurpose existing ones.
Read surfaces (17)
dashboard.view, services.view, projections.view, event-store-explorer.view, event-modeling.view, projection-stepper.view, alerts.view, metrics.view, health.view, audit-log.view, scheduled-messages.view, dead-letters.view, tenants.view, listeners.view, timeline.view, topology.view, durability.view.
MCP tool family gates (4)
mcp.alerts.read, mcp.health.read, mcp.performance.read, mcp.traces.read.
Operator actions (27)
| Group | Capabilities |
|---|---|
| DLQ | dlq.replay, dlq.discard, dlq.edit |
| Listeners | listener.pause, listener.restart, listener.drain, endpoint.update |
| Projections | projection.pause, projection.restart, projection.rebuild, subscription.rewind |
| Scheduled messages | scheduled-message.cancel, scheduled-message.reschedule, scheduled-message.edit |
| Tenants | tenant.add, tenant.enable, tenant.disable, tenant.remove, tenant.hard-delete |
| Cluster / agents | node.eject, agent.pin, agent.unpin, election.trigger |
| Alerts | alert.acknowledge, alert.snooze, alert.clear, alert.config.edit |
| ChaosMonkey | chaos-monkey.toggle, chaos-monkey.configure |
| Configuration | config.edit |
Design notes
- Toggle vs configure split. ChaosMonkey on/off is a separate gate from the rate / delay knobs so an operator trusted to stop chaos isn't automatically trusted to crank the dial higher first. Same pattern is applied across HTTP + MCP surfaces.
- HardDelete vs Remove.
tenant.hard-deleteis split fromtenant.removebecause hard-delete issues PostgreSQLDROP DATABASE … WITH (FORCE)on the tenant's database, while soft remove leaves the database intact for backup / forensic recovery.
Custom authorizer skeleton
A starting point that grants roles via OIDC claim mapping:
public sealed class RoleClaimAuthorizer : ICritterWatchAuthorizer
{
private readonly IReadOnlyDictionary<string, IReadOnlySet<string>> _capabilityRoles = new Dictionary<string, IReadOnlySet<string>>
{
[Capabilities.DlqReplay] = new HashSet<string> { "sre", "platform" },
[Capabilities.DlqDiscard] = new HashSet<string> { "platform" },
[Capabilities.ProjectionRebuild] = new HashSet<string> { "platform" },
[Capabilities.TenantHardDelete] = new HashSet<string> { "platform-admin" },
// … one row per capability you want to gate
};
public Task<bool> IsAllowedAsync(
ClaimsPrincipal principal,
string capability,
string? resource = null,
CancellationToken ct = default)
{
if (!_capabilityRoles.TryGetValue(capability, out var requiredRoles))
{
// Anything not explicitly listed is denied — fail-closed default.
return Task.FromResult(false);
}
var has = principal.Claims
.Where(c => c.Type == ClaimTypes.Role || c.Type == "role" || c.Type == "roles")
.Any(c => requiredRoles.Contains(c.Value));
return Task.FromResult(has);
}
}Hook it up:
builder.Services.AddCritterWatchAuthorization<RoleClaimAuthorizer>();For resource-scoped rules (e.g. only allow alert.clear on services matching a per-rotation prefix), read the resource parameter:
public Task<bool> IsAllowedAsync(
ClaimsPrincipal principal,
string capability,
string? resource,
CancellationToken ct)
{
if (capability == Capabilities.AlertClear && resource is not null)
{
var serviceName = resource.Split(':').Skip(1).FirstOrDefault();
if (serviceName is not null && OnCallRotation.OwnsService(principal, serviceName))
return Task.FromResult(true);
}
return _baseline.IsAllowedAsync(principal, capability, resource, ct);
}Testing your authorizer
RbacFoundationTests and RbacEnforcementHttpTests in the CritterWatch repo are the reference patterns. For your own host:
[Fact]
public async Task sre_role_can_replay_dlq()
{
var authorizer = new RoleClaimAuthorizer(/* … */);
var principal = new ClaimsPrincipal(
new ClaimsIdentity(
[new Claim(ClaimTypes.Role, "sre")],
authenticationType: "Test"));
var allowed = await authorizer.IsAllowedAsync(
principal, Capabilities.DlqReplay, resource: "TripService");
allowed.ShouldBeTrue();
}For HTTP-level end-to-end coverage, Alba scenarios against a real Wolverine.Http pipeline exercise the full codegen path — see src/Tests/Services/RbacEnforcementHttpTests.cs for the shape.
Related
- Cross-application MCP server — RBAC enforcement on the MCP tool surface
- Clustering — RBAC works the same in single-node and clustered deployments
- Licensing — license gating is a separate, complementary layer (RBAC governs which authenticated principals can do something; license gating governs whether the operation is enabled at all)
