Meteo Cluster

Introduction

This plugin allows you to get predictions on multiple timeseries for the next week, month, quarter or year at the hour, day, week or month precision. Depending on the consumption switch, available target timeseries are:

When the consumption switch is set on Performance:
- The number of jobs in queue
- The cluster load for running jobs
- The cluster load for running jobs relatively to the total cores number of the cluster
When the consumption switch is set on Cost:
- The jobs-related cost
- The cost related to your Cloud consumption (Budget edition only)
When the consumption switch is set on Power:
- The jobs power consumption (Energy edition only)

How it works

The models training occurs in backend and you can then see the training metrics and ask for predictions using MeteoCluster plugin in OKA interface. During each training, a new model will be trained using the most recent logs. You can define the amount of data to keep as test in the administrator panel (see Meteo Cluster - Conf). Default is the period to predict. Upon successful training, the test dataset will be used to update the model following an online learning procedure. This way, the model will be updated on the most recent data available and ready to predict the future unseen period.

Before training, you must first configure/modify a MeteoCluster pipeline in the administrator panel. For each prediction pipeline, you have 2 configurations to make:

the pipeline configuration (see Conf pipelines). Using this configuration, you can filter the input jobs used to train the models.

the MeteoCluster configuration (see Meteo Cluster - Conf). Using this configuration, you can modify the training and algorithm parameters. You can chose if you want to optimize the model hyperparameters, which algorithm to use…

Meteo Cluster training is then triggered using OKA scheduler. You can see the status of the training pipelines in the administrator panel (see Periodic Tasks). If needed, you can also manually trigger a training using this panel.

The displayed graph

When the training is completed, you will be able to see the predictions for the past and next periods using OKA interface. The lines displayed on the main graph show the observed values in blue and the predictions in red. Predictions are computed n steps ahead depending on the Pred seq parameter in MeteoCluster configuration. Default is 1 step ahead (1)-ahead, i.e. the observed values until the n-1 step will be used to predict the nth step. Observed values used for training and testing are highlighted by the light blue and red areas, respectively. If new logs have been ingested since the last training, observed values will be observed after the testing period.

Use the date filter to select the period during which you would like to retrieve the observed values. You can add the confidence intervals on the predictions using the checkbox (available only for LSTM algorithm). You can choose to display the predictions for an older model by choosing another training date in the corresponding dropdown list.

Post-processors

A post-processor allows you to spot peaks and valleys on the future predictions based on a target threshold for a specific duration. Using this tool, you will be able to:

Spot in the future when your cluster will be less used to plan a maintenance period

Spot in the future when your cluster will be overloaded to anticipate the users needs

Metrics

You will have access to the training metrics using the See train & test metrics.

Test metrics

Predictions accuracies on the test dataset are evaluated using 3 metrics:

R2: Coefficient of determination in percent. It shows how well observed values are replicated by the model. The higher the R2, the better the predictions.

RMSE: Root mean squared error. It measures the differences between values predicted by the model and the values observed. It is in the same unit than the predicted target. The lower the RMSE, the better the model.

rRMSE: Relative Root mean squared error. It is the RMSE normalized by the mean observed values. It is expressed in percent. The lower the rRMSE, the better the model.

The metrics history shows the evolution of those accuracy metrics along time (each point is related to 1 training date).

Train metrics

The training details show metrics about the jobs used to train and test the model.

If available, the parameters used by MeteoCluster and by the algorithm can be displayed using the Parameters used for training button.

Depending on the algorithm used, the interface will display different information about the trained model:

If LSTM: The features used to train the model

If NeuralProphet: The forecast components used by the model

The next graph shows the hyperparameters used by the model. If optimization of the hyperparameters has been done, it will display the Mean Squared Error associated to each hyperparameters set.

The last graph is the evolution of Train and Validation losses along epochs.

Training procedure

MeteoCluster is pre-configured to train automatically, however it is also possible to manually train the models. During a manual training, you will train a model iteratively to find the best parameters combination to maximize your predictions accuracies.

To do a manual training, you must follow this procedure:

In the admin panel, modify the training parameters in METEO CLUSTER > Conf meteo clusters for the concerned pipeline, for example:

Algorithms if you want to change the algorithm used by default (LSTM).

Select Epochs auto if you want to optimize the number of epochs used to train the algorithm. During epochs optimization MeteoCluster will monitor the validation loss to stop model training upon its decrease. The minimum number of epochs is 10% of the number of epochs set in the configuration.

Modify N lags to configure the number of timesteps in the past to consider to make the next forecast. For example, using 14 as N lags, the model will use a sequence of 14 timesteps to predict the 15th one.

Modify Pred seq to set how much steps ahead must be considered to make the predictions. For example, using 2 as Pred seq, the model will try to predict 2 steps ahead. If we keep the example used for N lags, it will predict the 16th timestep instead of the 15th.

Once you have updated the training parameters, trigger the model training using the RUN NOW button in the PERIODIC TASKS > Periodic tasks panel (see Periodic Tasks). In this panel, you will be able to follow the training status. A successful task means that the training has ended.

Go to OKA interface > Meteo Cluster to see the past and future predictions, and the associated training metrics.

Use the training history to spot the best accuracies. The hover will show you the training date of the model for each point. You can then display the metrics and parameters used for this model using the dropdown list on training dates.

Repeat the previous steps to find the best parameters combination.

Some hyperparameters of the model can be automatically optimized when checking Optimize in the Advanced options of MeteoCluster configuration. By default 100 hyperparameters combinations will be tried. This will then take a long time to compute. The result of the optimization will be displayed in the Train metrics. The best combination, minimizing the Mean Squared Error, will be used to train the model. Check Previous hyperparameters and uncheck Optimize if you wish to use an optimized combination of hyperparameters for the following trainings. You may also modify MeteoCluster configuration to set those hyperparameters.