SLURM Plugin for OKA Predict
Warning
Currently only available for OKA >= 1.15.0 <= 2.0.0
How it works
The following schema represent the internal process of the plugin.
Is user eligible
: Check if current user ID is part of those the plugin should be applied to upon submission. See Configuration to see how to use this option.OKA - login
: OKA requires a user to be logged-in in order to use any of its API. For the time being login needs to be made using a login/password (See Configuration). At this step, OKA Predict will send back a token used in further steps to prove our identity.Convert job to JSON
: Transform Slurm internal struct into a JSON object understandable by OKA Predict.Default
: Some parameters (Account
,Partition
,QOS
,User
,Timelimit
,WCKey
) can be automatically set if they are missing based on partition or Slurm main configuration.Deduced
: Some parameters (Requested_Nodes
,Requested_Memory
,Requested_Cores
) require to be in a specific format for OKA Predict therefore at this step we will deduce their proper values and reformat those fields to fit our requirements.Added
: Some parameters will be added as they are not present within the job information by default. The submission timeSubmit
will be set to be the time the job goes through the plugin. TheJobID
will be set using Slurm internal methodget_next_job_id
.Warning
get_next_job_id
only send the value that should be given to the next job handled by the scheduler. We need to test in order to confirm there is no overlap and that this job will be the one considered submitted with this value if we wish to use this later.
OKA - request prediction
: Call OKA Predict API to request a prediction for a specific target. Plugin will send job as JSON, current target as well as a list of predictors to OKA Predict. OKA Predict will automatically attempt to filter the list of predictors to keep only those fitting the given job. The remaining predictors after this step will be used in the order they were defined to try and predict a result for the given job. If a result is found and the confidence of the prediction is above the specifiedconfidence_level_threshold
(See Configuration) for the predictor then, OKA Predict will return the result. Otherwise, it will try the next predictor in the list and so on until either a result that fits the requirements is found or no more predictors are available.Update job
: If OKA Predict sent back a prediction then, the plugin will replace the job original values with the one sent back by OKA Predict.
Targets
Currently supported targets are:
ExecutionTimeBins
: Predict how long will the job be running for. Replace Slurmtime_limit
value with prediction result. Slurm has a precision rounded to the nearest minute for this value, so even if Predict returns a value of30 seconds
, it will be rounded to1 minute
.
MaxRSSBins
: Predict the peak usage of memory the job will encounter. Replace Slurmpn_min_memory
value with prediction result.
Logs
The plugin will generate logs at either DEBUG
or INFO
level depending on the configuration.
Logs will be written using Slurm log mechanism therefore, they will be available in one of those files:
/var/log/messages
/var/log/slurmctld.log
Note
DEBUG
and INFO
mentioned here are internal level for the plugin, in both cases, logs will be sent as Slurm INFO
logs.
Configuration
The plugin configuration is in JSON format, and must be stored in a file named ${PATH_TO_SLURM_CONF_DIR}/predictit.conf
(with ${PATH_TO_SLURM_CONF_DIR}
being the path where SLURM stores its configuration files). Here is a template for the configuration file:
{
"user": "@OKA_USER_LOGIN@",
"password": "@OKA_USER_PASSWORD@",
"oka_url": "@OKA_URL@",
"targets":
{
"ExecutionTimeBins":
{
"safety_factor": 1,
"predictors":
[
{
"pipeline_name": "predictor_exec1",
"pipeline_id": "5",
"confidence_level_threshold": 0.7
},
{
"pipeline_name": "predictor_exec2",
"pipeline_id": "7",
"confidence_level_threshold": 0.8
}
]
},
"MaxRSSBins":
{
"safety_factor": 1.1,
"predictors":
[
{
"pipeline_name": "predictor_mem",
"pipeline_id": "6",
"confidence_level_threshold": 0.8
}
]
}
},
"restricted_users": [1501],
"restricted_groups": [5443],
"debug": true,
"force_login": true
}
Configuration file format:
user
: OKA login.password
: OKA user password.oka_url
:[INIT]
URL to reach OKA. This parameter will only be read ONCE the first time Slurm will use the plugin. If you wish to change the OKA URL you will need to restart Slurm.targets
:{}
A dict of OKA Predict targets that we will try to predict per job:safety_factor
: A float > 0. It represents the factor by which prediction result will be multiplied before updating the job with it. Use this to either reduce or increase the predicted value. Default value is 1 if the field is not present in the configuration.predictors
:{}
A list of predictor with the following parameters:pipeline_name
: A string representing the predictor name within OKA database.pipeline_id
: An integer representing the predictor ID within OKA database.confidence_level_threshold
: A float between 0 and 1. It represents the minimum level of confidence a prediction should reach in order for the plugin to take into account its result.
Note
pipeline_name
andpipeline_id
can be retrieved manually from the admin panel or using the Predict-IT Clientlist
command.restricted_users
:[]
An array of numerical UID. Prediction will be attempted only for users whose UIDs are listed. If undefined or empty, the plugin will be applied for all users.Note
The user id (UID) that will be checked against this list by the plugin will be either the one of the user who submitted the job or the value of the
--uid
option used while submitting the job.restricted_groups
:[]
An array of numerical GID. Prediction will be attempted only for groups whose GIDs are listed. If undefined or empty, the plugin will be applied for all groups.Note
The group id (GID) that will be checked against this list by the plugin will be either the one associated with the user who submitted the job or the value of the
--gid
option used while submitting the job.Important
If only one of the
--uid
or--gid
options is used while submitting a job, the value used within the plugin for the other option will be the one associated with the user actually submitting the job.dry_run
:false
A Boolean to specify if the result of the predictions should not be applied to the submitted job. Use this if you wish to see how the plugin behaves without any impact on the jobs.debug
:false
A Boolean to specify whether or notDEBUG
level logs should be used or not.force_login
:true
A Boolean to specify whether or not the plugin should attempt to connect to OKA prior to making predictions’ API call.
Note
force_login
is currently true
by default because the plugin does not yet handle automatic reconnection.
Known Limitations
Login
Automatic re-login is not yet supported. Use
force_login: true
to avoid errors while using the plugin.
Memory
Currently, only jobs requesting per node memory will be handled when dealing with
MaxRSSBins
target.