HealthcheckRule

class applications.healthcheck.core.base.HealthcheckRule(workload: Workload, **params: Any)

Bases: ABC

Abstract base class for all healthcheck rules.

Rule authors must: 1. Define params_model class attribute (a Pydantic BaseModel subclass) 2. Implement evaluate() method that returns a list of Alert objects

The base class provides: - Parameter validation via the params_model - Access to the workload context - Data access methods for querying cluster load metrics

Example

>>> from pydantic import BaseModel, Field
>>> from applications.healthcheck.core import HealthcheckRule, Alert
>>> from applications.healthcheck.models import AlertLevel
>>>
>>> class MyRuleParams(BaseModel):
...     threshold: float = Field(default=0.90, ge=0.0, le=1.0)
...
>>> class MyRule(HealthcheckRule):
...     params_model = MyRuleParams
...
...     def evaluate(self) -> list[Alert]:
...         load = self.get_core_load()
...         if load.utilization_rate > self.params.threshold:
...             return [Alert(
...                 level=AlertLevel.HIGH,
...                 title="High core utilization",
...                 description=f"Utilization is {load.utilization_rate:.1%}",
...                 reason=f"Exceeds threshold of {self.params.threshold:.1%}",
...             )]
...         return []
params_model

Class attribute - Pydantic model for parameter validation.

Type:

type[pydantic.main.BaseModel]

params

Instance attribute - Validated parameters for this execution.

workload

Instance attribute - The workload this rule operates on.

property cluster_load: ClusterLoadDataProvider

Get or create the cluster load data provider (cores, cost, CO₂, power).

Type:

ClusterLoadDataProvider

property congestion: CongestionDataProvider

Get or create the congestion data provider (cluster congestion metrics).

Type:

CongestionDataProvider

property cores: CoresDataProvider

Get or create the cores data provider (cores distribution analysis).

Type:

CoresDataProvider

property cores_memory: CoresMemoryDataProvider

Get or create the cores-memory data provider (cores vs memory distribution).

Type:

CoresMemoryDataProvider

abstractmethod evaluate() list[Alert]

Execute rule logic and return alerts.

This is the main method that rule authors must implement. It should: 1. Access data using the data access methods (self.get_core_load, etc.) 2. Apply business logic to detect issues 3. Return a list of Alert objects for any issues found

Returns:

List of Alert objects. Empty list means no issues detected. Return AlertLevel.OK alerts explicitly if you want to record that the check passed.

Notes

  • Do not catch exceptions; let them propagate for logging

  • Access parameters via self.params

  • Access workload via self.workload

  • Use data access methods provided by base class

property execution_time: ExecTimeDataProvider

Get or create the execution time data provider.

Type:

ExecTimeDataProvider

property gpu: GpuDataProvider

Get or create the GPU data provider (GPU distribution analysis).

Type:

GpuDataProvider

property gpu_load: GpuLoadDataProvider

Get or create the GPU load data provider (allocated GPUs, utilization).

Type:

GpuLoadDataProvider

property job_frequency: JobFrequencyDataProvider

Get or create the job frequency data provider.

Type:

JobFrequencyDataProvider

property job_load: JobLoadDataProvider

Get or create the job load data provider (running/waiting job counts).

Type:

JobLoadDataProvider

property jobs: JobsDataProvider

Get or create the jobs data provider (job listing and counts from AIT).

Type:

JobsDataProvider

property kpi: KpiDataProvider

Get or create the KPI data provider.

Type:

KpiDataProvider

property memory: MemoryDataProvider

Get or create the memory data provider (memory distribution analysis).

Type:

MemoryDataProvider

property memory_ratio: MemoryRatioDataProvider

Get or create the memory ratio data provider (consumed vs requested memory).

Type:

MemoryRatioDataProvider

property nodes: NodesDataProvider

Get or create the nodes data provider (nodes distribution analysis).

Type:

NodesDataProvider

property nodes_load: NodesLoadDataProvider

Get or create the nodes load data provider (per-node job counts).

Type:

NodesLoadDataProvider

property occupancy: OccupancyDataProvider

Get or create the occupancy data provider (cores/nodes state from monitoring).

Type:

OccupancyDataProvider

property slowdown: SlowdownDataProvider

Get or create the slowdown data provider.

Type:

SlowdownDataProvider

property state: StateDataProvider

Get or create the state data provider (jobs status by state).

Type:

StateDataProvider

property submission_time: SubmitDateDataProvider

Get or create the submission date data provider.

Type:

SubmitDateDataProvider

property waiting_time: WaitTimeDataProvider

Get or create the wait time data provider.

Type:

WaitTimeDataProvider

property workload: Workload

Access the workload for filters or alert context.

Returns:

The Workload instance this rule is evaluating against.