Skip to content

Role-based access control

CritterWatch ships with an auth-agnostic, ClaimsPrincipal-based RBAC layer (#218). It governs who can view what and — more importantly — who can take state-changing actions (DLQ replay, projection rebuild, tenant ops, chaos toggles, listener pause / restart, etc.) across every monitored service.

Off-mode is the default. Single-tenant deployments and existing hosts that haven't configured RBAC keep working exactly as they did before — every authenticated caller (including anonymous in dev) can do everything the license permits. Enforced mode is a single DI call away.

How CritterWatch decides

CritterWatch never owns identity. The host's authentication layer (OIDC, reverse-proxy header trust, JWT bearer, Windows auth, whatever) builds a ClaimsPrincipal; CritterWatch reads it. A small interface decides what that principal is allowed to do:

csharp
public interface ICritterWatchAuthorizer
{
    Task<bool> IsAllowedAsync(
        ClaimsPrincipal principal,
        string capability,
        string? resource = null,
        CancellationToken ct = default);
}
  • capability — one of the Capabilities string constants. Identifies the thing the caller is trying to do.
  • resource — optional scope: the target service name, projection shard, tenant id, alert stream id, etc. Lets you write rules like "this on-call rotation can clear alerts on TripService but not RepairShop."

You implement this interface against whatever rule store fits your environment — LDAP groups, OIDC role claims, a static config file, a database, a custom rule engine. CritterWatch never assumes a schema.

Turning enforcement on

Register your authorizer alongside the rest of the CritterWatch services. The DI extension replaces the off-mode DefaultAllowAuthorizer:

csharp
builder.Services.AddCritterWatchAuthorization<MyAuthorizer>();

That single line lights up enforcement everywhere CritterWatch ships gates today — HTTP endpoints, MCP action tools, all paths annotated with [RequiresPermission]. No further wiring is needed; the off-mode shim is removed and your authorizer drives every decision.

For incremental rollout (e.g. you want to ship enforced mode but double-check denies against logs first), wrap your authorizer to log-and-allow during a soak window before flipping to log-and-deny.

Off-mode is preserved when no custom authorizer is registered.AddCritterWatchServices(...) auto-wires AddCritterWatchAuthorization() with TryAdd so the BFF's static codegen always resolves an authorizer; the resulting DefaultAllowAuthorizer returns true for every decision, which is the pre-#218 behaviour. Your explicit AddCritterWatchAuthorization<TAuthorizer>() replaces it.

Fail-closed on missing principal

If enforced mode is on and a request arrives without an authenticated principal (principal?.Identity?.IsAuthenticated != true), CritterWatch rejects the request before calling your authorizer. That short-circuit exists so a misconfigured authentication scheme can't accidentally let unauthenticated callers through against a permissive authorizer.

Off-mode (no custom authorizer registered) keeps allowing anonymous principals — that's the pre-#218 behaviour for single-tenant dev hosts.

What's gated, and how

HTTP endpoints

State-changing operator HTTP endpoints are annotated with the [RequiresPermission] attribute. Wolverine's HTTP codegen weaves the RBAC check into the chain ahead of the endpoint body:

csharp
public static class AlertEndpoint
{
    [RequiresPermission(Capabilities.AlertClear)]
    [WolverinePost("/api/critterwatch/alerts/{alertStreamId}/clear")]
    public static async Task<AlertClearedMessage?> ClearAlert(...)
    {
        // body runs only if the authorizer said yes
    }
}

On deny the request short-circuits with a 403 Forbidden ProblemDetails response. The denied capability is exposed in the capability extension field so clients can map the response back to a specific missing grant:

json
{
  "status": 403,
  "title": "Forbidden",
  "detail": "Caller is not authorized for capability \"alert.clear\".",
  "capability": "alert.clear"
}

No UseExceptionHandler ceremony is required on the host — the frame writes the response directly.

MCP action tools

The cross-application MCP server (CritterWatch.Mcp) exposes 21 RBAC-gated state-changing tools across six families:

FamilyToolsCapabilities
DLQReplay / Discarddlq.replay, dlq.discard
ProjectionPause / Restart / Rebuildprojection.pause, projection.restart, projection.rebuild
TenantAdd / Enable / Disable / Remove / HardDeletetenant.add, tenant.enable, tenant.disable, tenant.remove, tenant.hard-delete
AlertAcknowledge / Snooze / Clearalert.acknowledge, alert.snooze, alert.clear
ChaosMonkeyEnable / Disable / SetFailureRate / SetSlowHandler / SetProjectionFailureRatechaos-monkey.toggle (enable/disable), chaos-monkey.configure (rate / delay knobs)
ListenerPause / Restart / Drainlistener.pause, listener.restart, listener.drain

Every tool calls the same enforcement helper before publishing the underlying command — the deny envelope is stable JSON the MCP client returns verbatim:

json
{
  "error": "Forbidden",
  "message": "Caller is not authorized for capability 'dlq.replay' on resource 'TripService'.",
  "capability": "dlq.replay",
  "resource": "TripService"
}

The MCP transport is configured stateless so IHttpContextAccessor reflects the current tool invocation's principal — see the MCP integration page for details.

The capability catalog

Strings, not enums — keep the wire format stable across audit-log replay. Add new capabilities at the bottom of Capabilities; never renumber or repurpose existing ones.

Read surfaces (17)

dashboard.view, services.view, projections.view, event-store-explorer.view, event-modeling.view, projection-stepper.view, alerts.view, metrics.view, health.view, audit-log.view, scheduled-messages.view, dead-letters.view, tenants.view, listeners.view, timeline.view, topology.view, durability.view.

MCP tool family gates (4)

mcp.alerts.read, mcp.health.read, mcp.performance.read, mcp.traces.read.

Operator actions (27)

GroupCapabilities
DLQdlq.replay, dlq.discard, dlq.edit
Listenerslistener.pause, listener.restart, listener.drain, endpoint.update
Projectionsprojection.pause, projection.restart, projection.rebuild, subscription.rewind
Scheduled messagesscheduled-message.cancel, scheduled-message.reschedule, scheduled-message.edit
Tenantstenant.add, tenant.enable, tenant.disable, tenant.remove, tenant.hard-delete
Cluster / agentsnode.eject, agent.pin, agent.unpin, election.trigger
Alertsalert.acknowledge, alert.snooze, alert.clear, alert.config.edit
ChaosMonkeychaos-monkey.toggle, chaos-monkey.configure
Configurationconfig.edit

Design notes

  • Toggle vs configure split. ChaosMonkey on/off is a separate gate from the rate / delay knobs so an operator trusted to stop chaos isn't automatically trusted to crank the dial higher first. Same pattern is applied across HTTP + MCP surfaces.
  • HardDelete vs Remove. tenant.hard-delete is split from tenant.remove because hard-delete issues PostgreSQL DROP DATABASE … WITH (FORCE) on the tenant's database, while soft remove leaves the database intact for backup / forensic recovery.

Custom authorizer skeleton

A starting point that grants roles via OIDC claim mapping:

csharp
public sealed class RoleClaimAuthorizer : ICritterWatchAuthorizer
{
    private readonly IReadOnlyDictionary<string, IReadOnlySet<string>> _capabilityRoles = new Dictionary<string, IReadOnlySet<string>>
    {
        [Capabilities.DlqReplay] = new HashSet<string> { "sre", "platform" },
        [Capabilities.DlqDiscard] = new HashSet<string> { "platform" },
        [Capabilities.ProjectionRebuild] = new HashSet<string> { "platform" },
        [Capabilities.TenantHardDelete] = new HashSet<string> { "platform-admin" },
        // … one row per capability you want to gate
    };

    public Task<bool> IsAllowedAsync(
        ClaimsPrincipal principal,
        string capability,
        string? resource = null,
        CancellationToken ct = default)
    {
        if (!_capabilityRoles.TryGetValue(capability, out var requiredRoles))
        {
            // Anything not explicitly listed is denied — fail-closed default.
            return Task.FromResult(false);
        }

        var has = principal.Claims
            .Where(c => c.Type == ClaimTypes.Role || c.Type == "role" || c.Type == "roles")
            .Any(c => requiredRoles.Contains(c.Value));

        return Task.FromResult(has);
    }
}

Hook it up:

csharp
builder.Services.AddCritterWatchAuthorization<RoleClaimAuthorizer>();

For resource-scoped rules (e.g. only allow alert.clear on services matching a per-rotation prefix), read the resource parameter:

csharp
public Task<bool> IsAllowedAsync(
    ClaimsPrincipal principal,
    string capability,
    string? resource,
    CancellationToken ct)
{
    if (capability == Capabilities.AlertClear && resource is not null)
    {
        var serviceName = resource.Split(':').Skip(1).FirstOrDefault();
        if (serviceName is not null && OnCallRotation.OwnsService(principal, serviceName))
            return Task.FromResult(true);
    }

    return _baseline.IsAllowedAsync(principal, capability, resource, ct);
}

Testing your authorizer

RbacFoundationTests and RbacEnforcementHttpTests in the CritterWatch repo are the reference patterns. For your own host:

csharp
[Fact]
public async Task sre_role_can_replay_dlq()
{
    var authorizer = new RoleClaimAuthorizer(/* … */);
    var principal = new ClaimsPrincipal(
        new ClaimsIdentity(
            [new Claim(ClaimTypes.Role, "sre")],
            authenticationType: "Test"));

    var allowed = await authorizer.IsAllowedAsync(
        principal, Capabilities.DlqReplay, resource: "TripService");

    allowed.ShouldBeTrue();
}

For HTTP-level end-to-end coverage, Alba scenarios against a real Wolverine.Http pipeline exercise the full codegen path — see src/Tests/Services/RbacEnforcementHttpTests.cs for the shape.

  • Cross-application MCP server — RBAC enforcement on the MCP tool surface
  • Clustering — RBAC works the same in single-node and clustered deployments
  • Licensing — license gating is a separate, complementary layer (RBAC governs which authenticated principals can do something; license gating governs whether the operation is enabled at all)

Released under the MIT License.