SLURM Plugin for OKA Predict

Note

Available for OKA >= 2.4.0

How it works

The following schema represent the internal process of the plugin.

image0

  • Is user eligible: Check if current user ID is part of those the plugin should be applied to upon submission. See Configuration to see how to use this option.

  • OKA - login: OKA requires a user to be authenticated in order to use any of its API. For the time being login needs to be made using a login/password (See Configuration). At this step, OKA Predict will send back a token used in further steps to prove our identity.

  • Convert job to JSON: Transform Slurm internal struct into a JSON object understandable by OKA Predict.

    • Default: Some parameters (Account, Partition, QOS, User, Timelimit, WCKey) can be automatically set if they are missing based on partition or Slurm main configuration.

    • Deduced: Some parameters (Requested_Nodes, Requested_Memory, Requested_Cores) require to be in a specific format for OKA Predict therefore at this step we will deduce their proper values and reformat those fields to fit our requirements.

    • Added: Some parameters will be added as they are not present within the job information by default. The submission time Submit will be set to be the time the job goes through the plugin. The JobID will be set using Slurm internal method get_next_job_id.

      Warning

      get_next_job_id only send the value that should be given to the next job handled by the scheduler. We need to test in order to confirm there is no overlap and that this job will be the one considered submitted with this value if we wish to use this later.

  • OKA - request prediction: Call OKA API to request a prediction for a specific target. Plugin will send job as JSON, current target as well as a list of predictors to OKA Predict. OKA Predict will automatically attempt to filter the list of predictors to keep only those fitting the given job. The remaining predictors after this step will be used in the order they were defined to try and predict a result for the given job. If a result is found and the confidence of the prediction is above the specified confidence_level_threshold (See Configuration) for the predictor then, OKA Predict will return the result. Otherwise, it will try the next predictor in the list and so on until either a result that fits the requirements is found or no more predictors are available.

  • Update job: If OKA Predict sent back a prediction then, the plugin will replace the job original values with the one sent back by OKA Predict.

Targets

Currently supported targets are:

  • ExecutionTimeBins: Predict how long will the job be running for. Replace Slurm time_limit value with prediction result. Slurm has a precision rounded to the nearest minute for this value, so even if Predict returns a value of 30 seconds, it will be rounded to 1 minute.

  • MaxRSSBins: Predict the peak usage of memory the job will encounter. Replace Slurm pn_min_memory value with prediction result.

Logs

The plugin will generate logs at either DEBUG or INFO level depending on the configuration. Logs will be written using Slurm log mechanism therefore, they will be available in one of those files:

  • /var/log/messages

  • /var/log/slurmctld.log

Note

DEBUG and INFO mentioned here are internal level for the plugin, in both cases, logs will be sent as Slurm INFO logs.

Configuration

The plugin configuration is in JSON format, and must be stored in a file named ${PATH_TO_SLURM_CONF_DIR}/predictit.conf (with ${PATH_TO_SLURM_CONF_DIR} being the path where SLURM stores its configuration files). Here is a template for the configuration file:

{
        "user": "@OKA_USER_LOGIN@",
        "password": "@OKA_USER_PASSWORD@",
        "oka_url": "@OKA_URL@",
        "targets":
        {
                "ExecutionTimeBins":
                {
                        "safety_factor": 1,
                        "predictors":
                        [
                                {
                                        "predictor_name": "demo2_predictor_executiontimebins_1",
                                        "predictor_id": "10",
                                        "confidence_level_threshold": 0.7
                                },
                                {
                                        "predictor_name": "demo2_predictor_executiontimebins_2",
                                        "predictor_id": "7",
                                        "confidence_level_threshold": 0.8
                                }
                        ]
                },
                "MaxRSSBins":
                {

                        "safety_factor": 1.1,
                        "predictors":
                        [
                                {
                                        "predictor_name": "demo2_predictor_maxrssbins_1",
                                        "predictor_id": "11",
                                        "confidence_level_threshold": 0.8
                                }
                        ]
                }
        },
        "restricted_users": [1000],
        "restricted_groups": [1000],
        "debug": true,
        "force_login": true
}

Configuration file format:

  • user: OKA login.

  • password: OKA user password.

  • oka_url: [INIT] URL to reach OKA. This parameter will only be read ONCE the first time Slurm will use the plugin. If you wish to change the OKA URL you will need to restart Slurm.

  • targets: {} A dict of OKA Predict targets that we will try to predict per job:

    • safety_factor: A float > 0. It represents the factor by which prediction result will be multiplied before updating the job with it. Use this to either reduce or increase the predicted value. Default value is 1 if the field is not present in the configuration.

    • predictors: {} A list of predictor with the following parameters:

      • predictor_name: A string representing the predictor name within OKA database.

      • predictor_id: An integer representing the predictor ID within OKA database.

      • confidence_level_threshold: A float between 0 and 1. It represents the minimum level of confidence a prediction should reach in order for the plugin to take into account its result.

    Note

    predictor_name and predictor_id can be found on the Predictor dashboard in OKA (see Predictors).

  • restricted_users: [] An array of numerical UID. Prediction will be attempted only for users whose UIDs are listed. If undefined or empty, the plugin will be applied for all users.

    Note

    The user id (UID) that will be checked against this list by the plugin will be either the one of the user who submitted the job or the value of the --uid option used while submitting the job.

  • restricted_groups: [] An array of numerical GID. Prediction will be attempted only for groups whose GIDs are listed. If undefined or empty, the plugin will be applied for all groups.

    Note

    The group id (GID) that will be checked against this list by the plugin will be either the one associated with the user who submitted the job or the value of the --gid option used while submitting the job.

    Important

    If only one of the --uid or --gid options is used while submitting a job, the value used within the plugin for the other option will be the one associated with the user actually submitting the job.

  • dry_run: false A Boolean to specify if the result of the predictions should not be applied to the submitted job. Use this if you wish to see how the plugin behaves without any impact on the jobs.

  • debug: false A Boolean to specify whether or not DEBUG level logs should be used or not.

Warning

force_login is currently true by default because the plugin does not yet handle automatic reconnection. If you set it to false the Plugin will work only as long as the first authentication token does not expire but after that it will stop providing prediction as it won’t be able to access OKA anymore.

Known Limitations

Login

  • Automatic re-login is not yet supported. Use force_login: true to avoid errors while using the plugin.

Memory

  • Currently, only jobs requesting per node memory will be handled when dealing with MaxRSSBins target.