Skip to content

SLA Monitoring

Kiket tracks workflow SLAs directly in the runtime so teams can spot issues before they miss a promise. SLA definitions live alongside workflows inside definition repositories and are evaluated continuously by the Risk Scanner job.

Defining SLAs

Create workflow_sla_definitions.yml (or embed the block inside a project manifest) to describe the states and guardrails that matter:

- id: ops-critical-in-progress
  status: in_progress
  issue_type: Incident
  priority: high
  max_duration_minutes: 120
  warning_buffer_minutes: 30
  project: platform-core
  • Status – workflow state to watch. Optional filters (issue_type, priority, project) scope the definition.
  • max_duration_minutes – breach threshold. When an issue stays in the state longer than this value, the SLA is marked breached.
  • warning_buffer_minutes – buffer before the breach. When elapsed time exceeds max_duration - buffer, the SLA enters the imminent state and notifications fire.
  • Definitions are tenant-scoped: customers can check them in with the rest of their .kiket repo and swap templates at any time. Validation runs during definition sync so invalid assets never activate.

Detection Pipeline

  1. RiskScanner::Analyzer runs on a schedule for every organization (and per project when needed).
  2. Workflow::SlaEvaluator loads active definitions, scans the matching issues, and either opens or resolves workflow_sla_events.
  3. NotificationService.send_sla_event notifies the assignee, project lead, and admins with contextual details and links.
  4. Extensions::WorkflowExtensionDispatcher emits workflow.sla_status events so inline code and external webhooks can react.

Each workflow_sla_event records:

Field Description
state imminent, breached, or recovered.
triggered_at When the SLA entered the current state.
duration_minutes Minutes spent in the monitored state.
overdue_minutes Minutes over the SLA (only for breaches).
definition Snapshot of the definition (status, limits, filters).

Surfaces & Alerts

  • In-app notifications – assignees, project leads, and org admins receive alerts with direct links to the offending issue.
  • Command palette – open the palette (Cmd/Ctrl + K) inside a project and run Review SLA alerts. The action summarises the five most recent imminent/breached events and links you to the issues.
  • CLIkiket sla events --org acme --project 42 --state imminent fetches the stream directly from /api/v1/sla_events. Use --format json when piping into other tooling.
  • Extensions – call GET /api/v1/ext/sla/events?project_id=... with an extension API key or subscribe to the workflow.sla_status webhook. SDK helpers expose this as context.endpoints.sla_events(project_id) across Python, Node.js, Ruby, Java, and .NET.
  • Dashboards – SLA metrics join the analytics warehouse so cost dashboards and operations overviews can chart warning trends alongside throughput.

API Reference

Method Path Notes
GET /api/v1/sla_events Organization authenticated endpoint (requires organization param when scripting). Supports project_id, issue_id, state, limit query params.
GET /api/v1/ext/sla/events Extension-scoped endpoint. Requires project_id and the read:issues scope on the extension API key.
Webhook workflow.sla_status Fired for each state transition. Payload includes state, issue, sla.definition, and metrics (duration/overdue minutes).

Webhook payload example:

{
  "event": "workflow.sla_status",
  "state": "imminent",
  "issue": {
    "id": 42,
    "status": "in_progress",
    "title": "Escalated onboarding incident",
    "project_id": 17
  },
  "sla": {
    "definition_id": 7,
    "status": "in_progress",
    "max_duration_minutes": 120,
    "warning_buffer_minutes": 30
  },
  "metrics": {
    "duration_minutes": 105,
    "overdue_minutes": null
  }
}

CLI Usage

kiket sla events --org acme --project 17 --state breached --limit 10 --format json
  • --project and --issue filters can be combined.
  • --state recovered is helpful for auditing recent recoveries without paging every notification.
  • --format human (default) prints a table with columns for duration and overdue minutes.

Extension Examples

All SDKs expose the SLA helper:

  • Python: context.endpoints.sla_events(project_id).list(limit=5)
  • Node.js: context.endpoints.slaEvents(projectId).list({ state: 'imminent' })
  • Ruby: context[:endpoints].sla_events(project_id).list
  • Java: context.getEndpoints().slaEvents(projectId).list(options)
  • .NET: await context.Endpoints.SlaEvents(projectId).ListAsync()

Use it to: - Post custom alerts in Slack/Teams. - Gate deployments when breaches exist. - Populate downstream analytics pipelines.

Operational Notes

  • SLA evaluation reuses column_changed_at timestamps from boards. If your workflow mutates state outside the standard board orchestrator, be sure to update that field.
  • Definitions support tenant overrides. When a repo fails validation (duplicate IDs, invalid values) the platform surfaces actionable errors instead of silently falling back.
  • Marketplace bundles can include SLA definition files so partners ship ready-to-use guardrails.