Key concepts

Concept

Description

Workload

Defines what data to query: which clusters and which filters to apply. See Workloads.

Rule

A Python script that receives job data and returns a list of alerts. Rules are versioned and follow a draft/publish workflow.

Healthcheck

The main configuration: links a workload to one or more rules, sets a schedule, and configures notifications.

Evaluation

A single execution of all rules in a healthcheck. Produces an overall alert level and individual alerts per rule.

Alert level

Severity assigned by a rule. See Alert levels.

Notification profile

A reusable delivery configuration (email, webhook, file). See Notification profiles.

Alert levels

Rules return alerts with one of five severity levels:

Level

Meaning

OK

No issues detected.

LOW

Minor issue; informational.

MEDIUM

Moderate issue requiring attention.

HIGH

Serious issue requiring prompt action.

CRITICAL

Emergency condition.

The overall alert level of an evaluation is the highest level returned by any rule. Notifications are triggered based on changes to this overall level.

Rules

A rule is a versioned Python script stored in OKA. It queries cluster data through a set of built-in data-access methods and returns a list of alerts.

Rules are managed from Healthcheck → Rules.

Rule list page

rule_list

Shows all rules with their latest version, draft status, the number of healthchecks using each rule, and the last update timestamp.

From this page you can create, edit, duplicate, or delete rules, and open the test sandbox.

Creating and editing rules

rule_editor

The rule editor provides a Python code editor with syntax highlighting, autocomplete, fullscreen mode, and file import. The editor follows the same draft/publish workflow as Data Enhancers:

  • Draft — the working version. Editable; cannot be linked to a healthcheck. Only one draft per rule at a time.

  • Published — an immutable snapshot. Can be assigned to healthchecks. Any further edit produces a new draft version.

To publish a draft, click Publish v{n} – Draft and enter an optional commit message. The Versions tab shows the full version history.

Rule code format

Every rule file must contain exactly one class that:

  • Inherits from HealthcheckRule (from applications.healthcheck.core.base import HealthcheckRule), see HealthcheckRule to find all the available methods that allow to get access to the metrics. All methods automatically apply the workload’s cluster and filter scope. resolution defaults to "1hour"; supported values include "1minute", "1hour", "1day", "1month".

  • Defines a params_model class attribute pointing to a Pydantic BaseModel subclass.

  • Implements an evaluate(self) -> list[Alert] method, seel Alert.

Here is an example of such a rule:

"""
Rule: High core utilization detection
Description: Alert when average core utilization exceeds thresholds.
"""

from pydantic import BaseModel, Field

from applications.healthcheck.core import Alert, HealthcheckRule
from applications.healthcheck.models import AlertLevel


class Params(BaseModel):
    high_threshold: float = Field(default=0.80, ge=0.0, le=1.0)
    critical_threshold: float = Field(default=0.95, ge=0.0, le=1.0)


class CoreLoadRule(HealthcheckRule):
    """Alert when average core utilization exceeds thresholds."""

    params_model = Params

    def evaluate(self) -> list[Alert]:
        load = self.cluster_load.get_core_load()

        if not load.timestamps:
            return []

        p = self.params
        utilization = load.utilization_rate

        if utilization >= p.critical_threshold:
            level = AlertLevel.CRITICAL
        elif utilization >= p.high_threshold:
            level = AlertLevel.HIGH
        else:
            return []

        threshold = p.critical_threshold if level == AlertLevel.CRITICAL else p.high_threshold

        return [
            Alert(
                level=level,
                title=f"Core utilization at {utilization:.1%}",
                description=(
                    f"Average {load.avg_allocated:.0f} / {load.avg_capacity:.0f} cores allocated "
                    f"over {len(load.timestamps)} buckets ({load.resolution})."
                ),
                reason=f"Utilization {utilization:.1%} exceeded threshold {threshold:.1%}.",
                metadata={
                    "utilization_rate": round(utilization, 4),
                    "avg_allocated": round(load.avg_allocated, 1),
                    "avg_capacity": round(load.avg_capacity, 1),
                    "max_allocated": round(load.max_allocated, 1),
                },
            )
        ]

Alert structure — each Alert must provide:

  • level — an AlertLevel enum value.

  • title — a short summary (max 512 characters).

  • description — a detailed explanation, or a recommendation.

  • reason — why the alert was triggered (the specific condition that fired).

  • metadata — a dict with any structured data useful for adding additional information and debugging.

Return an empty list ([]) when no issues are detected.

Testing rules with the sandbox

The sandbox lets you run a draft or published rule against a workload without creating an evaluation record.

Access from Healthcheck → Rules → Test sandbox, or from the rule list action menu.

  1. Select the rule and version to test.

  2. Select the workload to run it against.

  3. Optionally override rule parameters.

  4. Click Test.

rule_sandbox

Results show the alerts returned, execution time, and any error traceback. Nothing is persisted — the sandbox is a safe development environment.

Healthchecks

A healthcheck ties together a workload, a set of rules, a schedule, and notification settings.

Healthcheck dashboard

healthcheck_dashboard

The dashboard lists all healthchecks with their workload, current overall alert level, number of linked rules, and the date of the last evaluation.

Actions available per row:

  • Configure — open the configuration form.

  • View history — open the evaluation history for this healthcheck.

  • Enable / Disable — pause or resume the scheduled evaluation.

  • Run now — trigger an immediate evaluation outside the schedule.

Creating or editing a healthcheck

The configuration form has three tabs.

General tab

healthcheck_config_general

  • Name (required) — a unique name for this healthcheck.

  • Description — optional free-text description.

  • Workload (required) — select from existing workloads. The workload determines which clusters and filters are used for all rules.

    Note

    Changing the workload affects all rules in this healthcheck immediately. Workloads are shared and mutable — editing a workload impacts all healthchecks that reference it.

  • Schedule (cron) — when to run evaluations automatically. Uses the same cron fields as cluster ingestion (Minute, Hour, Day of Month, Day of Week, Month, Timezone).

  • Retention (days) — how long healthcheck evaluations are kept. Toggle on “Keep forever” if you do not want them to be automatically deleted.

Rules tab

healthcheck_config_rules

Lists the rules currently linked to this healthcheck.

  • Add rule — select a rule to link. Only published versions are available.

  • Version pinning — by default, the healthcheck always uses the latest published version of each rule (auto-upgrade). To lock to a specific version, pin it explicitly. Pinning is useful when you need reproducibility or want to control when upgrades happen.

  • Custom parameters — override the rule’s default parameter values for this healthcheck. Parameters not overridden inherit the rule’s published defaults.

  • Remove rule — unlinks the rule from this healthcheck (does not delete the rule).

Version strategy

Behavior

Latest (default, pin = null)

Always runs the most recently published version of the rule.

Pinned

Always runs the specific version selected, regardless of newer publications.

Notifications tab

healthcheck_config_notifications

  • Notification Enabled — master switch. When off, no notifications are sent regardless of other settings.

  • Minimum alert level — only send a notification if the overall alert level is at or above this threshold (OK, LOW, MEDIUM, HIGH, CRITICAL).

  • Notify on recovery — send a notification when the overall alert level decreases (i.e., the situation improves). Useful for confirming that an issue is resolved.

  • Notification profiles — select one or more notification profiles to deliver alerts to. All channels in all selected profiles receive the notification.

Notifications are triggered when:

  1. Notifications are enabled and

  2. the overall alert level is at or above the minimum threshold or the level decreased and “notify on recovery” is enabled.

Evaluation history

evaluation_history

The evaluation history page for a healthcheck shows:

  • Each evaluation’s timestamp, overall alert level, and execution time.

  • Whether the alert level changed compared to the previous evaluation.

  • Per-rule breakdowns: which version ran, the effective parameters, execution time, and any error.

  • Individual alerts: level, title, description, reason, and metadata.

Note

History is kept according to the healthcheck’s retention setting. Activate “Keep forever” to keep evaluations indefinitely. Otherwise, evaluations older than the configured number of days are deleted automatically.