HealthcheckRule
- class applications.healthcheck.core.base.HealthcheckRule(workload: Workload, **params: Any)
Bases:
ABCAbstract base class for all healthcheck rules.
Rule authors must: 1. Define params_model class attribute (a Pydantic BaseModel subclass) 2. Implement evaluate() method that returns a list of Alert objects
The base class provides: - Parameter validation via the params_model - Access to the workload context - Data access methods for querying cluster load metrics
Example
>>> from pydantic import BaseModel, Field >>> from applications.healthcheck.core import HealthcheckRule, Alert >>> from applications.healthcheck.models import AlertLevel >>> >>> class MyRuleParams(BaseModel): ... threshold: float = Field(default=0.90, ge=0.0, le=1.0) ... >>> class MyRule(HealthcheckRule): ... params_model = MyRuleParams ... ... def evaluate(self) -> list[Alert]: ... load = self.get_core_load() ... if load.utilization_rate > self.params.threshold: ... return [Alert( ... level=AlertLevel.HIGH, ... title="High core utilization", ... description=f"Utilization is {load.utilization_rate:.1%}", ... reason=f"Exceeds threshold of {self.params.threshold:.1%}", ... )] ... return []
- params_model
Class attribute - Pydantic model for parameter validation.
- Type:
type[pydantic.main.BaseModel]
- params
Instance attribute - Validated parameters for this execution.
- workload
Instance attribute - The workload this rule operates on.
- property cluster_load: ClusterLoadDataProvider
Get or create the cluster load data provider (cores, cost, CO₂, power).
- Type:
- property congestion: CongestionDataProvider
Get or create the congestion data provider (cluster congestion metrics).
- Type:
- property cores: CoresDataProvider
Get or create the cores data provider (cores distribution analysis).
- Type:
- property cores_memory: CoresMemoryDataProvider
Get or create the cores-memory data provider (cores vs memory distribution).
- Type:
- abstractmethod evaluate() list[Alert]
Execute rule logic and return alerts.
This is the main method that rule authors must implement. It should: 1. Access data using the data access methods (self.get_core_load, etc.) 2. Apply business logic to detect issues 3. Return a list of Alert objects for any issues found
- Returns:
List of Alert objects. Empty list means no issues detected. Return AlertLevel.OK alerts explicitly if you want to record that the check passed.
Notes
Do not catch exceptions; let them propagate for logging
Access parameters via self.params
Access workload via self.workload
Use data access methods provided by base class
- property execution_time: ExecTimeDataProvider
Get or create the execution time data provider.
- Type:
- property gpu: GpuDataProvider
Get or create the GPU data provider (GPU distribution analysis).
- Type:
- property gpu_load: GpuLoadDataProvider
Get or create the GPU load data provider (allocated GPUs, utilization).
- Type:
- property job_frequency: JobFrequencyDataProvider
Get or create the job frequency data provider.
- Type:
- property job_load: JobLoadDataProvider
Get or create the job load data provider (running/waiting job counts).
- Type:
- property jobs: JobsDataProvider
Get or create the jobs data provider (job listing and counts from AIT).
- Type:
- property kpi: KpiDataProvider
Get or create the KPI data provider.
- Type:
- property memory: MemoryDataProvider
Get or create the memory data provider (memory distribution analysis).
- Type:
- property memory_ratio: MemoryRatioDataProvider
Get or create the memory ratio data provider (consumed vs requested memory).
- Type:
- property nodes: NodesDataProvider
Get or create the nodes data provider (nodes distribution analysis).
- Type:
- property nodes_load: NodesLoadDataProvider
Get or create the nodes load data provider (per-node job counts).
- Type:
- property occupancy: OccupancyDataProvider
Get or create the occupancy data provider (cores/nodes state from monitoring).
- Type:
- property slowdown: SlowdownDataProvider
Get or create the slowdown data provider.
- Type:
- property state: StateDataProvider
Get or create the state data provider (jobs status by state).
- Type:
- property submission_time: SubmitDateDataProvider
Get or create the submission date data provider.
- Type:
- property waiting_time: WaitTimeDataProvider
Get or create the wait time data provider.
- Type:
- property workload: Workload
Access the workload for filters or alert context.
- Returns:
The Workload instance this rule is evaluating against.