FAQ
OKA lifecycle management
Elasticsearch
Integration with an AWS cluster
Efficiently running an HPC infrastructure is complex, and often lacks the proper tools to track down and get insights on how the users are behaving and how the cluster is responding to the demand. This is even more complicated when using Cloud clusters. Due to their transient and dynamic nature, information about instances types, location and costs are important assets to monitor, especially when you pay for what you use. The task of managing and presenting these metrics becomes increasingly difficult when the scale of the infrastructure grows or undergoes changes over time, which is a common situation in Cloud environments.
“Standard” metrics and information provided by the job scheduler might not be sufficient to efficiently manage the Cloud clusters. For example, tracking the cost of running jobs becomes even more important in order to monitor and manage your budget, and redistribute the costs to your users/departments. Due to the wide variety of compute instance types in the cloud, it can also be interesting to track on which instance types the jobs have run, in order to check their performances and associated costs, and further improve the placement and instance selection of the jobs.
OKA offers many benefits to monitor your jobs, and deep-dive into how your clusters behave and how they are being used by your end-users. Accessing information about the Cloud in OKA is straightforward once your environment is properly configured. In this article, we present a simple integration that can be made in a Slurm cluster on AWS to retrieve the type of instances jobs runs on, their pricing information (on-demand/spot, per hour price…), the AWS region, and virtually any information about the AWS environment you are using. The scripts provided below are given as examples, and can easily be adapted to retrieve more detailed information, or be adapted to work with other job schedulers (e.g., LSF, PBS…).
There are many ways to create a cluster on AWS, the details are out of the scope of this article, but you can use for example AWS ParallelCluster or CCME (note: the solution presented here is extracted from CCME where it is available out of the box).
The principle depicted here is very simple, and relies on two components:
A Slurm epilog script that will gather information about the AWS environment on which the job runs, and store this information as comma separated values (CSV) in the Comment field of the job. The gathered information are:
instance type
instance id of the “main” job node
availability zone
region
instance price
cost type: ondemand or spot
tenancy: shared, reserved…
An OKA Data Enhancer that will parse the values of the
Comment
field, and store them as additional information with each job.

Note
The solution depicted here can also easily be adapted to other Cloud providers. For example, you could follow the indications presented in the Azure integration with Slurm presented here in the “Granular Cost Control” section.
Slurm epilog script
This Slurm epilog script retrieves information about the instance type and its pricing when the job ends,
and stores the information in the Comment
field of the job in sacct
.
The user provided comments are kept, and the information are added at the end after a semicolon.
The format of the Comment
field is the following:
:PricingInfo=${instance type};${instance id};${availability zone};${region};${instance price};${cost type};${tenancy}
The following packages are required and should be available on all the nodes of the cluster:
Also note that this solution requires that Slurm has been configured to keep accounting information about the jobs. See Slurm documentation to configure accounting manually, or if you are using AWS ParallelCluster you can follow this guide.
As the epilog script contacts AWS APIs to gather the information, it needs to run on instances having (at least) the following policy attached to the role of the instance:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeInstances",
"ec2:DescribeRegions",
"pricing:GetProducts"
],
"Resource": "*"
}
]
}
Script
#!/bin/bash
declare logFile="/var/log/ccme.slurmepilog.log"
touch "${logFile}"; chmod -v 600 "${logFile}"
exec > >(awk '{printf "[%s] %s\n", strftime("%FT%T%z"), $0; fflush()}' >>"${logFile}" || true)
exec 2>&1;
# Uncomment the following line for debug traces
#set -x
slurm_env="/opt/slurm/etc/slurm.sh"
if [[ -f "${slurm_env}" ]]; then
# shellcheck source=/dev/null
. "${slurm_env}"
fi
declare -A regions
regions["af-south-1"]="Africa (Cape Town)"
regions["ap-east-1"]="Asia Pacific (Hong Kong)"
regions["ap-east-1"]="Asia Pacific (Hong Kong)"
regions["ap-northeast-1"]="Asia Pacific (Tokyo)"
regions["ap-northeast-2"]="Asia Pacific (Seoul)"
regions["ap-south-1"]="Asia Pacific (Mumbai)"
regions["ap-south-2"]="Asia Pacific (Hyderabad)"
regions["ap-southeast-1"]="Asia Pacific (Singapore)"
regions["ap-southeast-2"]="Asia Pacific (Sydney)"
regions["ap-southeast-3"]="Asia Pacific (Jakarta)"
regions["ap-southeast-4"]="Asia Pacific (Melbourne)"
regions["ca-central-1"]="Canada (Central)"
regions["eu-central-1"]="EU (Frankfurt)"
regions["eu-central-2"]="Europe (Zurich)"
regions["eu-north-1"]="EU (Stockholm)"
regions["eu-south-1"]="Europe (Milan)"
regions["eu-south-2"]="Europe (Spain)"
regions["eu-west-1"]="EU (Ireland)"
regions["eu-west-2"]="EU (London)"
regions["eu-west-3"]="EU (Paris)"
regions["me-central-1"]="Middle East (UAE)"
regions["me-south-1"]="Middle East (Bahrain)"
regions["sa-east-1"]="South America (Sao Paulo)"
regions["us-east-1"]="US East (N. Virginia)"
regions["us-east-2"]="US East (Ohio)"
regions["us-west-1"]="US West (N. California)"
regions["us-west-2"]="US West (Oregon)"
# Gather Information about job environment on AWS
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
insttype=$(curl -s -H "X-aws-ec2-metadata-token: ${TOKEN}" -v http://169.254.169.254/latest/meta-data/instance-type)
instid=$(curl -s -H "X-aws-ec2-metadata-token: ${TOKEN}" -v http://169.254.169.254/latest/meta-data/instance-id)
az=$(curl -s -H "X-aws-ec2-metadata-token: ${TOKEN}" -v http://169.254.169.254/latest/meta-data/placement/availability-zone)
region=$(curl -s -H "X-aws-ec2-metadata-token: ${TOKEN}" -v http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region || true)
# We try to detect if we use spot through AWS APIs
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-purchasing-options.html#check-instance-lifecycle
# Use the following describe-instances command:
# aws ec2 describe-instances --instance-ids i-1234567890abcdef0
# - If the instance is running on a Dedicated Host, the output contains the following information: "Tenancy": "host"
# - If the instance is a Dedicated Instance, the output contains the following information: "Tenancy": "dedicated"
# - If the instance is a Spot Instance, the output contains the following information: "InstanceLifecycle": "spot"
# - If the instance is a Scheduled Instance, the output contains the following information: "InstanceLifecycle": "scheduled"
# - Otherwise, the output does not contain InstanceLifecycle.
lifecycle=$(aws --region="${region}" ec2 describe-instances --instance-ids "${instid}" | jq -r ".Reservations[0].Instances[0].InstanceLifecycle" || true)
tenancy=$(aws --region="${region}" ec2 describe-instances --instance-ids "${instid}" | jq -r ".Reservations[0].Instances[0].Placement.Tenancy" || true)
costtype="ondemand" # Default value
if [[ "${lifecycle}" == "spot" ]]; then
costtype="spot"
elif [[ "${lifecycle}" == "" ]]; then
costtype="ondemand"
fi
# Get instance price
if [[ "${costtype}" == "ondemand" ]]; then
if [[ "${tenancy}" == "default" ]]; then
tenancy="shared"
fi
filters=(
"Type=TERM_MATCH,Field=instanceType,Value=${insttype}"
"Type=TERM_MATCH,Field=location,Value=${regions[${region}]}"
"Type=TERM_MATCH,Field=operatingSystem,Value=Linux"
"Type=TERM_MATCH,Field=preInstalledSw,Value=NA"
"Type=TERM_MATCH,Field=capacitystatus,Value=Used"
"Type=TERM_MATCH,Field=tenancy,Value=${tenancy}"
)
# Warning: if tenancy==host (reserved), then the price will be $0.00
instprice=$(aws --region=us-east-1 pricing get-products --service-code AmazonEC2 --filter "${filters[@]}" | jq -rc '.PriceList[0]' | jq -r '[.terms.OnDemand[].priceDimensions[].pricePerUnit.USD][0]' || true)
elif [[ "${costtype}" == "spot" ]]; then
instprice=$(aws --region "${region}" ec2 describe-spot-price-history --availability-zone "${az}" --instance-types "${insttype}" --start-time "$(date '+%Y-%m-%dT%H:%M:%S')" --product-descriptions "Linux/UNIX" | jq -r '.SpotPriceHistory[0].SpotPrice' || true)
else
# Currently we do not manage other types of pricing
instprice=0
fi
comment=$(scontrol show job "${SLURM_JOBID}" | grep 'Comment=' | awk -F'Comment=' '{print $2}' || true)
comment+=":PricingInfo=${insttype};${instid};${az};${region};${instprice};${costtype};${tenancy}"
echo "Setting comment for job ${SLURM_JOBID}: ${comment}"
sacctmgr -i -Q modify job where JobID="${SLURM_JOBID}" set Comment="${comment}"
Installation
Copy the epilog script to a folder accessible on all nodes, e.g.,
/shared_nfs/slurm/slurm-epilog.sh
, and give it execution rights:chmod +x /shared_nfs/slurm/slurm-epilog.sh
Edit
/etc/slurm/slurm.conf
(on all nodes), and set theEpilog
option to/shared_nfs/slurm/slurm-epilog.sh
Reconfigure Slurm daemons:
scontrol reconfigure
, or restart them:systemctl restart slurmd
Then submit a job. Once finished, check that in the output of sacct
you have in the Comment
field the expected output:
sacct --format "jobid,comment"
.
OKA Data Enhancer
A Data Enhancer needs to be created and configured in OKA in order to parse the additional data gathered by the Slurm Epilog script. We propose here an example Data Enhancer that you can adapt to your needs (included in comment is the generation of “fake” data if you wish to test it first):
import logging import numpy as np logger = logging.getLogger("oka_main") class EnhancerAWSSLURMFeature(): VERSION = "1.0.0" def parse_comment(comment): # :pricinginfo=${instance type};${instance id};${availability zone};${region};${price};${spot|ondemand};{tenancy} # e.g., :pricinginfo=c5n.18xlarge;i-0cd3c13fa4599d4d5;eu-west-1b;eu-west-1;1.241400;spot;shared scsv = comment.split(':pricinginfo=')[1].split(';') insttype = scsv[0] instid = scsv[1] az = scsv[2] region = scsv[3] instprice = float(scsv[4]) if scsv[4].lower() not in ["na", "nan"] else 0.0 costtype = scsv[5] tenancy = scsv[6] return insttype, instid, az, region, instprice, costtype, tenancy def run(self, data, **kwargs): try: # Uncomment the following lines for testing with fake data # from random import random, randint # azs = [["a", "b", "c"][randint(0,2)] for y in range(len(data))] # prices = [random() for y in range(len(data))] # inst = [["c5n.18xlarge", "c5n.9xlarge", "c5n.4xlarge", "c5.4xlarge", "g4dn.8xlarge"][randint(0,4)] for y in range(len(data))] # data.loc[:, "Comment"] = ["comment:pricinginfo={};id12;eu-west-1{};eu-west-1;{};spot;shared".format(inst[x], azs[x], prices[x]) for x in range(len(data))] # Fill missing values with a set of default values data["Comment"].fillna(":pricinginfo=NA;NA;NA;NA;0;ondemand;shared", inplace=True) data.loc[:, "instance_type"], data.loc[:, "instance_id"], data.loc[:, "availability_zone"], data.loc[:, "region"], instprice, data.loc[:, "price_type"], data.loc[:, "tenancy"] = zip(*data["Comment"].apply(EnhancerAWSSLURMFeature.parse_comment)) duration = (data["End"].astype(np.datetime64) - data["Start"].astype(np.datetime64)).astype('timedelta64[s]')/3600.0 data.loc[:, "Cost"] = instprice * duration * data["Allocated_Nodes"] except Exception as e: logger.error(f"Cost information not available: {e}")
Installation
Please refer to Data enhancers section for explanations on how to install and configure this Data Enhancer in the ingestion pipeline.
Accessing AWS information in OKA
The information ingested through the Data Enhancer are then available in OKA in multiple plugins and through the filters. We present below a few examples of where the information can be accessed and used to analyze your workloads


Filters allow to select workloads based on the information gathered from AWS.

