Cluster management

The Cluster management page allows administrators to create, configure, upload data to, and delete clusters.

Accessing Cluster management:

Locate the sidebar menu on the left side of the screen
Click on the Management section (indicated by a gear icon at the bottom of the sidebar)
Select Clusters from the Management menu options

The page shows all clusters as a table. For each cluster, the table shows the last accounting ingestion date and the last monitoring ingestion date, along with per-source action buttons.

Available actions per row:

Edit — Open the cluster configuration form
Refresh — Trigger a manual data fetch from the job scheduler (or re-read from file)
Upload — Upload a data file directly from the browser
Delete data — Delete the ingested data for this source (accounting or monitoring)
Delete cluster — Permanently remove the cluster and all its data

At the top right of the page:

Refresh — Reload the cluster list
Create new cluster — Open the cluster creation form

Create or edit a cluster

The cluster configuration form is organized into five tabs: General tab, Hardware tab, Options tab, Accounting tab, and Monitoring tab.

Once all required fields are filled in, click Create Cluster (or Update Cluster when editing). The button is disabled until at least one field has been changed.

Note

Most fields can be modified after creation. The only exception is the Cluster unique identifier, which is set once and cannot be changed.

General tab

The General tab contains the cluster’s identity settings.

Cluster Name (required) — A human-readable name for the cluster. Stored in lowercase.
Cluster unique identifier (required) — A short string used to tag every data row in the database. Choose this carefully: it cannot be modified after the cluster is created.
Job Scheduler (required) — The workload manager running on the cluster. Supported values: slurm, Open Grid Engine, LSF, PBS, Torque.
Country where the cluster is located — Used for geographic metadata. Defaults to France.

Hardware tab

The Hardware tab records the cluster’s hardware capacity over time. Each entry represents a period during which the cluster had a given hardware configuration.

For each entry:

Date — When this configuration became active (format YYYY-MM-DD HH:MM:SS). The first entry is pre-filled with 1970-01-01 00:00:00 to cover all historical data.
Number of Cores — Total CPU cores in the cluster at that date.
Number of Nodes — Total compute nodes.
Number of GPUs — Total GPU accelerators (enter 0 if the cluster has no GPUs).

Use the Add row button to record a hardware change (e.g., after a cluster expansion). Any row except the first can be removed with the Delete row button.

Note

Hardware configuration values are used to compute occupancy and utilization metrics (for example in the Load module). Keeping the history accurate ensures correct percentages for past periods.

Options tab

The Options tab defines per-core-hour default values used as fallbacks when a job record does not carry its own Cost, Energy, or CO₂ measurement.

Cost / Core-hour — Default cost per core-hour.
Currency — Currency for cost values (e.g. EUR, USD, GBP…). Defaults to EUR.
Energy (Wh) / Core-hour — Default energy consumption per core-hour.
CO₂ (kgCO₂e) / Core-hour — Default carbon footprint per core-hour.

Note

These values are applied selectively: fallback is applied only when the job’s Cost / Energy / CO2 field is null and the option is set.

Setting any of these to 0 is valid and will be used as a fallback (returning 0), while leaving the field blank means no substitute will be computed.

Example — cluster options: Cost “not set”, Energy = 0, CO₂ = 0.05

Field	Job with value	Job without value
Cost	Existing value used	`null` returned (no fallback set) - no cost (equivalent to 0)
Energy	Existing value used	`0` Wh/core-hour computed as substitute
CO₂	Existing value used	`0.05` kgCO₂e/core-hour computed as substitute

Accounting tab

The Accounting tab configures how OKA ingests accounting logs (job history data) from the job scheduler.

Note

Accounting data drives most OKA modules: job counts, core-hours, cost, state distribution, throughput, and so on.

Job scheduler command type

Select the method OKA will use to retrieve accounting data:

Type	Description
`FILE`	OKA reads log files from a local directory or file path.
`LOCAL`	OKA executes the job scheduler command directly on the OKA host.
`FORWARDED_PWD`	OKA connects to a remote host via SSH (password authentication) and runs the scheduler command there.
`FORWARDED_KEYFILE`	OKA connects to a remote host via SSH (key-file authentication) and runs the scheduler command there.

Copy to Monitoring button

The Copy configuration to ‘Monitoring’ button copies the current accounting connection settings (command type, hostname, username, password / key path) to the Monitoring tab. Use this shortcut when both sources share the same connection.

—

Fields visible when command type is FILE

Local reading of file(s) path — Path to a local file or directory that OKA will parse. Supported glob patterns:
- /dir/path or /dir/path/* — All files at the root of the given directory.
- /dir/path/*.ext — All files with extension .ext at the root.
- /dir/path/**/* — All files recursively under the given directory.
- /dir/path/**/*.ext — All files with extension .ext recursively.
Note

When the path contains a *, OKA will not verify that the path exists at save time.

—

Fields visible when command type is FORWARDED_PWD or FORWARDED_KEYFILE

Hostname — The hostname or IP address of the machine running the job scheduler.
Username — The account to use for the SSH connection.
Password (FORWARDED_PWD only) — SSH password for the above account.
Key path (FORWARDED_KEYFILE only) — Path to the SSH private key file.

Warning

When using an SSH key, make sure it does not require a passphrase, otherwise OKA will be unable to authenticate.

—

Field visible when command type is LOCAL, FORWARDED_PWD, or FORWARDED_KEYFILE

Job scheduler timezone — Timezone used by the job scheduler when recording timestamps. Set this if the scheduler host is not in UTC; OKA uses it to avoid timezone mismatches during ingestion.

—

Accounting start date

Most recent date in database — Displays the most recent job date already stored for this cluster. OKA will use this as the starting point for the next incremental retrieval.
Use custom date as log start date — Check this box to override the start date for the next retrieval. When checked, an additional date picker appears:
- Accounting available since — OKA will retrieve (or keep) accounting logs starting from this date. Useful for backfilling historical data or limiting the ingestion window.

—

Advanced configuration (job-scheduler-specific)

These fields appear automatically based on the selected job scheduler:

GPU complex (Open Grid Engine only) — Name of the Grid Engine complex that represents a GPU resource (default: gpu). OKA uses this to extract GPU allocation data from OGE logs.
License RegExp (PBS only) — A Python regular expression to extract license names from the Resource_List field of PBS accounting records. The expression must contain a named capturing group license_name.

Example: lic(?P<license_name>.+)

Important

The named group must be called exactly license_name. See the Python re documentation for syntax reference.

—

Cron configuration

Defines when the scheduled ingestion task runs. All fields accept multiple values. Leaving a field at its default means “every” (* in cron notation).

Field	Description
Minute(s)	Minutes within the hour. Quick presets: every 5 / 10 / 15 / 30 minutes.
Hour(s)	Hours (0–23).
Day(s) of the Month	Days (1–31).
Day(s) of the Week	Days (Sunday = 0, Saturday = 6).
Month(s) of the Year	Months (January–December).
Cron Timezone	Timezone in which the cron schedule is interpreted. Defaults to `UTC`.

Enable — Check this box to activate the scheduled ingestion. Uncheck to pause it without losing the schedule configuration.

Note

Even with scheduled ingestion disabled, you can always trigger a manual ingestion using the Refresh button on the cluster list page.

—

Data enhancers

Select one or more published Data Enhancers to run after each ingestion. Enhancers are applied in the order they appear in the selection — reorder them accordingly.

The test sandbox (bug icon next to the header, only visible when editing an existing cluster) lets you run the selected enhancers against a sample of real data before saving.

—

Elasticsearch configuration

Number of shards — Number of Elasticsearch shards for the accounting index. Default: 4. Increasing this value can speed up searches on very large datasets, but requires more Elasticsearch resources.

Note

Changing this value after initial data ingestion requires re-indexing. After saving, reload the page to see the value actually applied to the Elasticsearch index.

Monitoring tab

The Monitoring tab configures ingestion of node-level monitoring data (real-time resource usage snapshots produced by the job scheduler). Its structure mirrors the Accounting tab tab with the following differences:

No Accounting start date — Monitoring data is always ingested incrementally from the last known record; there is no custom start date option.
No Job scheduler timezone — Monitoring data timestamps are handled differently and do not require a separate timezone override.
No Advanced configuration — GPU complex and License RegExp do not apply to monitoring data.
Default Elasticsearch shards: 1 — Monitoring indices are typically smaller than accounting indices, so a single shard is the default.

All other fields (command type, connection settings, file path, cron schedule, data enhancers, Elasticsearch shards) work identically to the Accounting tab.

Note

Not all job schedulers produce monitoring data. If your scheduler does not support it, you can leave this tab unconfigured.

Upload data

Note

Parsing and ingestion are always handled by an asynchronous background task.

UI

The Upload button on the cluster list opens a file upload dialog. Select a file and click Upload Data.

Accounting data — Accepted formats:

Single log file: .txt, .jobs, .csv
Compressed archive: .zip, .tar.gz, .gz

Monitoring data — Accepted formats:

Single log file: .txt, .nodes
Compressed archive: .zip, .tar.gz, .gz

For monitoring files, the filename must follow the convention: hostname_<date>[.txt|.nodes] (e.g. hostname_2025-06-11_08:30:02.txt). To upload multiple files at once, compress them into a single .zip or .tar.gz archive.

Warning

Files uploaded via the UI must not exceed 1 GB. For larger datasets, use the scheduled ingestion path or the file-based ingestion directory instead. See File upload limitation for details.

The Refresh button triggers a manual ingestion without uploading a new file:

For FILE ingestion — Re-reads the configured file or directory path.
For LOCAL / FORWARDED_* ingestion — Executes the scheduler command to fetch the latest records.

Scheduled ingestion

When the Enable checkbox in the cron configuration is active, OKA automatically runs ingestion according to the configured schedule.

FILE — OKA parses all files available at the configured path. Files are not deleted after ingestion; already-ingested files are skipped automatically.
LOCAL / FORWARDED — OKA executes the scheduler command to retrieve records from the last known submission date onward. If no data exists yet, it attempts to retrieve the full history.

Warning

If you have more than a year of history to retrieve, start with file-based ingestion (month by month) to avoid memory and time issues, then switch to scheduled command execution once the database is up to date. See Retrieve job scheduler data for data extraction scripts. Contact UCit Support if you need assistance.

Delete data or a cluster

You can delete the data associated with a cluster (accounting or monitoring independently) or permanently remove the entire cluster using the action buttons in the cluster list.

A confirmation dialog is shown before any destructive action.

Retrieve job scheduler data

If the cluster is configured with FILE ingestion, you need to export the logs from your job scheduler and copy them to the configured ingestion directory.

Accounting

GE/SGE/OGE — Logs are stored in ${SGE_ROOT}/${SGE_CELL}/common/accounting (e.g. /usr/share/gridengine/default/common/accounting). You can use this file directly or use the extraction script below:

extractSGEData.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################
START="2013-01-01T00:00:00"
CLEAN="yes"
if [[ -z "${SGE_ROOT}" ]] || [[ -z "${SGE_CELL}" ]]; then
  GEACCTFILE="/usr/share/gridengine/default/common/accounting"
else
  GEACCTFILE="${SGE_ROOT}/${SGE_CELL}/common/accounting"
fi

CURDATE=$(date "+%Y-%m-%dT%H:%M:%S")
FILENAME="$(hostname)_${CURDATE}"
TARBALL=0

set -a

function unknownName() {
  local _prefix="$1"
  local _name="$2"

  # is name in dict?
  local _dictname
  _dictname=$(grep -e "^${_name}=${_prefix}.*" "${tmpdictfile}" | awk -F '=' '{print $2}')

  if [[ -z "${_dictname}" ]]; then
    namecount=$(grep -e "${_prefix}=.*" "${tmpcntfile}" | awk -F '=' '{print $2}')
    namecount=$((namecount+1))
    safeid="${_prefix}_${namecount}"

    echo "${_name}=${safeid}" >> "${tmpdictfile}"
    echo "${_prefix}=${namecount}" > "${tmpcntfile}"
  else
    safeid="${_dictname}"
  fi
}

function getuid() {
  local _begline="$1"
  local _name="$2"
  local _endline="$3"
  safeid=$(id -u "${_name}" 2>/dev/null)
  rv=$?
  if [[ ${rv} -ne 0 ]] || [[ -z "${safeid}" ]]; then
    unknownName "unknownuser" "${_name}"
  fi
  echo "${_begline}${safeid}${_endline}"
}

function getgid() {
  local _begline="$1"
  local _name="$2"
  local _endline="$3"
  safeid=$(getent group "${_name}"  2>/dev/null | cut -d: -f3)
  rv=$?
  if [[ ${rv} -ne 0 ]] || [[ -z "${safeid}" ]]; then
    unknownName "unknowngroup" "${_name}"
  fi
  echo "${_begline}${safeid}${_endline}"
}



while [[ $# -ge 1 ]]; do
    key="$1"

    case ${key} in
        -s|--start)
            START="$2"
            shift # past argument
            ;;
        -a|--all)
            CLEAN="no"
            ;;
        -c|--clean)
            CLEAN="yes"
            ;;
        -f|--file)
            GEACCTFILE="$2"
            shift
            ;;
        -o|--output)
            FILENAME="${2}"
            shift # past argument
            ;;
        -t|--tgz)
            TARBALL=1
            ;;
        -h|--help)
            echo "$0 [-s|--start DATE_TIME] [-a|--all] [-c|--clean] [-f|--file ACCOUNTINGFILE] [-o|--output FILENAME] [-t|--tgz] [-h|--help]"
            echo "Extract historical data from GE, and create the following files:"
            echo "- ${FILENAME}.jobs: historical data on jobs"
            echo "- ${FILENAME}.nodes: current nodes description"
            echo "- ${FILENAME}.partitions: current partitions description"
            echo
            echo "-s|--start: start date and time (format: YYYY-MM-DDThh:mm:ss, default is ${START})"
            echo "-a|--all: print all fields"
            echo "-c|--clean: do not print usernames and group names (this is the default)"
            echo "-f|--file: path to GE accounting file (default is ${GEACCTFILE}"
            echo "-o|--output: output filename (extensions will be added: .job, .nodes, .partitions)"
            echo "-t|--tgz: create a tarball and print its path to stdout"
            echo "-h|--help: print this help"
            echo
            echo "\${SGE_ROOT} and \${SGE_CELL} must be set"
            exit 0
            ;;
        *)
            echo "Unknown option: ${key}"
            exit 1
            ;;
    esac
    shift # past argument or value
done

# TODO handle start and end dates
# nstart=$(echo "${START}" | awk -F'T' '{print $1}')
# dstart=$(date --date="${nstart}" +%s)


if [[ ! -f "${GEACCTFILE}" ]]; then
    echo "${GEACCTFILE} accounting file does not exist."
    echo "Please use the -f parameter to specify the correct accounting file."
    exit 1
fi

cp "${GEACCTFILE}" "${FILENAME}.jobs"


# Replace usernames by uid, and groupnames by gid
if [[ "${CLEAN}" == "yes" ]]; then
  tmpcntfile=$(mktemp)
  tmpdictfile=$(mktemp)

  sed -i.bak -r -e 's#([^:]*:[^:]*:)([^:]+)(.*)# getuid "\1" "\2" "\3"#e' -e 's#([^:]*:[^:]*:[^:]*:)([^:]+)(.*)# getgid "\1" "\2" "\3"#e' "${FILENAME}.jobs"
  rm -f "${FILENAME}.jobs.bak"
  rm -f "${tmpcntfile}"
  rm -f "${tmpdictfile}"
fi
if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.jobs created"
fi


# Gather nodes
for i in $(qconf -sel); do
  qconf -se "${i}" >> "${FILENAME}.nodes"
done
if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.nodes created"
fi

# # Gather configuration
# qconf -sconf > "${FILENAME}.conf"

# Gather list of queues
for i in $(qconf -sql); do
  qconf -sq "${i}" >> "${FILENAME}.partitions"
done

# Gather list of parallel environments (not exactly queues, but necessary to run // jobs)
cat >> "${FILENAME}.partitions" <<EOF
## Parallel environments
EOF
for i in $(qconf -spl); do
  qconf -sp "${i}" >> "${FILENAME}.partitions"
done
if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.partitions created"
fi


if [[ "${TARBALL}" -eq 1 ]]; then
  tgzfiles=("${FILENAME}.jobs" "${FILENAME}.nodes" "${FILENAME}.partitions")
  tar --force-local -zcf "${FILENAME}.tgz" "${tgzfiles[@]}"
  echo "${FILENAME}.tgz"
  rm -f "${tgzfiles[@]}"
fi

LSF — Logs are stored in multiple files under ${LSB_SHAREDIR}/cluster_name/logdir.

PBS / OpenPBS / Torque — Logs are in an accounting directory, typically:

Torque: /var/spool/torque/server_priv/accounting/
(Open)PBS: /var/spool/pbs/server_priv/accounting/

You can use these files directly or use the extraction script below:

extractPBSTorqueData.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################

START="2010-01-01T00:00:00"
CLEAN="yes"
TORQUEDIR="/var/spool/torque/server_priv/accounting/"
PBSDIR="/var/spool/pbs/server_priv/accounting/"
ACCOUNTINGDIR=""

CURDATE=$(date "+%Y-%m-%dT%H:%M:%S")
FILENAME="$(hostname)_${CURDATE}"
TARBALL=0

set -a

function unknownName() {
  local _prefix="$1"
  local _name="$2"

  # is name in dict?
  local _dictname
  _dictname=$(grep -e "^${_name}=${_prefix}.*" "${tmpdictfile}" | awk -F '=' '{print $2}')

  if [[ -z "${_dictname}" ]]; then
    namecount=$(grep -e "${_prefix}=.*" "${tmpcntfile}" | awk -F '=' '{print $2}')
    namecount=$((namecount+1))
    safeid="${_prefix}_${namecount}"

    echo "${_name}=${safeid}" >> "${tmpdictfile}"
    echo "${_prefix}=${namecount}" > "${tmpcntfile}"
  else
    safeid="${_dictname}"
  fi
}

function getuid() {
  local _begline="$1"
  local _name="$2"
  local _endline="$3"
  safeid=$(id -u "${_name}" 2>/dev/null)
  rv=$?
  if [[ ${rv} -ne 0 ]] || [[ -z "${safeid}" ]]; then
    unknownName "unknownuser" "${_name}"
  fi
  echo "${_begline}${safeid}${_endline}"
}

function getgid() {
  local _begline="$1"
  local _name="$2"
  local _endline="$3"
  safeid=$(getent group "${_name}"  2>/dev/null | cut -d: -f3)
  rv=$?
  if [[ ${rv} -ne 0 ]] || [[ -z "${safeid}" ]]; then
    unknownName "unknowngroup" "${_name}"
  fi
  echo "${_begline}${safeid}${_endline}"
}



while [[ $# -ge 1 ]]; do
    key="$1"

    case ${key} in
        -s|--start)
            START="$2"
            shift # past argument
            ;;
        -a|--all)
            CLEAN="no"
            ;;
        -c|--clean)
            CLEAN="yes"
            ;;
        -d|--directory)
            ACCOUNTINGDIR="$2"
            shift
            ;;
        -o|--output)
            FILENAME="${2}"
            shift # past argument
            ;;
        -t|--tgz)
            TARBALL=1
            ;;
        -h|--help)
            echo "$0 [-s|--start DATE_TIME] [-a|--all] [-c|--clean] [-d|--directory DIR] [-o|--output FILENAME] [-t|--tgz] [-h|--help]"
            echo "Extract historical data from Torque/PBS, and create the following files:"
            echo "- ${FILENAME}.jobs: historical data on jobs"
            echo "- ${FILENAME}.nodes: current nodes description"
            echo "- ${FILENAME}.partitions: current partitions description"
            echo
            echo "-s|--start: start date and time (format: YYYY-MM-DDThh:mm:ss, default is ${START})"
            echo "-a|--all: print all fields"
            echo "-c|--clean: do not print usernames and group names (this is the default)"
            echo "-d|--directory: path to Torque/PBS accounting directory (default is ${TORQUEDIR} for Torque, and ${PBSDIR} for PBS)"
            echo "-o|--output: output filename (extensions will be added: .job, .nodes, .partitions)"
            echo "-t|--tgz: create a tarball and print its path to stdout"
            echo "-h|--help: print this help"
            exit 0
            ;;
        *)
            echo "Unknown option: ${key}"
            exit 1
            ;;
    esac
    shift # past argument or value
done


nstart=$(echo "${START}" | awk -F'T' '{print $1}')
dstart=$(date --date="${nstart}" +%s)

version=$(pbsnodes --version 2>&1 |head -n 1 | grep pbs_version)
if [[ -n "${version// /}" ]]; then
  # PBS
  ISPBS="True"
else
  ISPBS="False"
fi

if [[ -z "${ACCOUNTINGDIR}" ]]; then
    if [[ "${ISPBS}" == "True" ]]; then
      ACCOUNTINGDIR="${PBSDIR}"
    else
      ACCOUNTINGDIR="${TORQUEDIR}"
    fi
fi

if [[ ! -d "${ACCOUNTINGDIR}" ]]; then
    echo "${ACCOUNTINGDIR} directory does not exist. Please use the -d parameter to specify the correct accounting directory."
    exit 1
fi

echo > "${FILENAME}.jobs"
for f in "${ACCOUNTINGDIR}"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9];do
    f=$(basename "${f}")
    df=$(date --date="${f}" +%s)

    if [[ "${df}" -lt "${dstart}" ]]; then
        continue
    fi

    grep -E "^[0-9]{2}/[0-9]{2}/[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2};[EDA];[^;]+;.*" "${ACCOUNTINGDIR}/${f}" >> "${FILENAME}.jobs"
done


# Replace usernames by uid, and groupnames by gid
if [[ "${CLEAN}" == "yes" ]]; then
  tmpcntfile=$(mktemp)
  tmpdictfile=$(mktemp)

  sed -i.bak -r -e 's#(.*user=)([[:alnum:]_-]*)(.*)# getuid "\1" "\2" "\3"#e' -e 's#(.*group=)([[:alnum:]_-]*)(.*)# getgid "\1" "\2" "\3"#e' -e 's#(.*owner=)([[:alnum:]_-]*)(.*)# getuid "\1" "\2" "\3"#e' -e 's#(.*requestor=)([[:alnum:]_-]*)(@.+)# getuid "\1" "\2" "\3"#e' "${FILENAME}.jobs"
  rm -f "${FILENAME}.jobs.bak"
  rm -f "${tmpcntfile}"
  rm -f "${tmpdictfile}"
fi


if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.jobs created"
fi

if [[ "${ISPBS}" == "True" ]]; then
    # PBS
    pbsnodes -a -F json > "${FILENAME}.nodes"
else
    # Torque
    pbsnodes -ax > "${FILENAME}.nodes"
fi

if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.nodes created"
fi

qstat -f -Q > "${FILENAME}.partitions"
if [[ "${TARBALL}" -ne 1 ]]; then
  echo "${FILENAME}.partitions created"
fi

if [[ "${TARBALL}" -eq 1 ]]; then
  tgzfiles=("${FILENAME}.jobs" "${FILENAME}.nodes" "${FILENAME}.partitions")
  tar --force-local -zcf "${FILENAME}.tgz" "${tgzfiles[@]}"
  echo "${FILENAME}.tgz"
  rm -f "${tgzfiles[@]}"
fi

SLURM — Logs are accessible via sacct. Use the script below with --data accounting:

extractSlurmData.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################

START="2010-01-01T00:00:00"

FORMAT=""

CURDATE=$(date "+%Y-%m-%dT%H:%M:%S")
FILENAME="$(hostname)_${CURDATE}"
TARBALL=0
EXTRACT_MODE="all"  # Options: all, acct (accounting: .jobs), mon (monitoring: .nodes, .partitions)

# Setup basic access to slurm bin sacct and scontrol
DEFAULT_SLURM_PATH="/opt/slurm/bin"

if ! SCONTROL=$(command -v scontrol 2>/dev/null); then
    SCONTROL="${DEFAULT_SLURM_PATH}/scontrol"
fi

if ! SACCT=$(command -v sacct 2>/dev/null); then
    SACCT="${DEFAULT_SLURM_PATH}/sacct"
fi

while [[ $# -ge 1 ]]; do
    key="$1"

    case ${key} in
        -s|--start)
            START="$2"
            shift # past argument
            ;;
        -e|--end)
            END="$2"
            shift # past argument
            ;;
        -b|--batch)
            BATCHDURATION="$2"
            shift # past argument
            ;;
        -a|--all)
            FORMAT="ALL"
            ;;
        -o|--output)
            FILENAME="${2}"
            shift # past argument
            ;;
        -p|--path)
            # Force specific path to look for slurm bin sacct and scontrol
            SCONTROL="${2}/scontrol"
            SACCT="${2}/sacct"
            shift # past argument
            ;;
        -t|--tgz)
            TARBALL=1
            ;;
        -d|--data)
            case "$2" in
                all)
                    EXTRACT_MODE="all"
                    ;;
                acct|accounting)
                    EXTRACT_MODE="acct"
                    ;;
                mon|monitoring)
                    EXTRACT_MODE="mon"
                    ;;
                *)
                    echo "Unknown data mode: $2. Valid options: all, acct (accounting), mon (monitoring)"
                    exit 1
                    ;;
            esac
            shift # past argument
            ;;
        -h|--help)
            echo "$0 [-s|--start DATE_TIME] [-e|--end DATE_TIME] [-b|--batch DAYS] [-a|--all] [-d|--data MODE] [-o|--output FILENAME] [-t|--tgz] [-p|--path PATH] [-h|--help]"
            echo "Extract historical data from SLURM, and create the following files:"
            echo "- ${FILENAME}.jobs: historical data on jobs (accounting)"
            echo "- ${FILENAME}.nodes: current nodes description (monitoring)"
            echo "- ${FILENAME}.partitions: current partitions description (monitoring)"
            echo
            echo "-s|--start: start date and time (format: YYYY-MM-DDThh:mm:ss, default is ${START})"
            echo "-e|--end: end date and time (format: YYYY-MM-DDThh:mm:ss)"
            echo "-b|--batch: number of days for the batch size. Split whole duration in 'x' smaller batches to run consecutive"
            echo "  small sacct instead of a big request."
            echo "-a|--all: print all fields"
            echo "-d|--data: data extraction mode (default: all)"
            echo "  all: extract all data (.jobs, .nodes, .partitions)"
            echo "  acct|accounting: extract accounting data only (.jobs)"
            echo "  mon|monitoring: extract monitoring data only (.nodes)"
            echo "-o|--output: output filename (extensions will be added: .jobs, .nodes, .partitions)"
            echo "-t|--tgz: create a tarball and print its path to stdout"
            echo "-p|--path: path to slurm bin folder where scontrol and sacct can be found"
            echo "-h|--help: print this help"
            exit 0
            ;;
        *)
            echo "Unknown option: ${key}"
            exit 1
            ;;
    esac
    shift # past argument or value
done

# Slurm version
SLURM_VERSION=$("${SCONTROL}" --version| awk '{print $2}')
SV_MAJOR=$(echo "${SLURM_VERSION}" | cut -d '.' -f 1)

# Check if format=ALL requested
if [[ "${FORMAT}" != "ALL" ]]; then
    FIELDS=$("${SACCT}" --helpformat)
    # Then check based on available entry on slurm what we can actually retrieve.
    declare -a FIELDS_ARRAY
    while IFS=' ' read -r -a array;
    do
        FIELDS_ARRAY+=("${array[@]}");
    done < <(echo "${FIELDS}")

    declare -a UNWANTED_FIELDS=("User" "Group")

    for element in "${FIELDS_ARRAY[@]}"
    do
        if [[ ! ${UNWANTED_FIELDS[*]} =~ (^|[[:space:]])"${element}"($|[[:space:]]) ]]; then
            FORMAT+="${element},"
        fi
    done
fi

# Set sacct options depending on the version of SLURM
declare -a SACCTOPT
SACCTOPT=("--duplicates" "--allusers" "--parsable2" "--format" "${FORMAT}")

if [[ ${SV_MAJOR} -gt 14 ]]; then
    # supported since version 15
    SACCTOPT+=("--delimiter=@|@")
fi

# Set END if not defined
if [[ -z "${END}" ]]; then
    END=$(date +"%Y-%m-%dT%H:%M:%S")
fi

# Array to track generated files for tarball
declare -a tgzfiles=()

# Extract accounting data (.jobs) if mode is 'all' or 'acct'
if [[ "${EXTRACT_MODE}" == "all" || "${EXTRACT_MODE}" == "acct" ]]; then
    #sacct by batch or just once
    if [[ -n "${BATCHDURATION}" && "${BATCHDURATION}" -gt 0 ]]; then
        ## Compute start & end date lists
        # convert to seconds:
        START_SEC=$(date +%s --date "${START}")
        END_SEC=$(date +%s --date "${END}")
        BATCHDURATION_SEC=$((BATCHDURATION*(3600*24)))

        # create date lists
        startlist=()
        endlist=()
        # Slurm sacct works with inclusive limits [start, end]
        # https://rc.byu.edu/wiki/?id=Using+sacct
        # The algorithm does: [start, (start2-1second)], [start2, (start3-1second)], [start3, end]
        # Tested on Slurm 20.
        CUR_DATE_SEC=${START_SEC}
        if (( BATCHDURATION_SEC < (END_SEC-START_SEC) )); then  # Batchduration is smaller than the whole duration
            while ((CUR_DATE_SEC < (END_SEC-BATCHDURATION_SEC) ));do
                CUR_DATE_SEC_FORM=$(date -d @"${CUR_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")
                startlist+=("${CUR_DATE_SEC_FORM}")
                # get end date:
                ((CUR_DATE_SEC+=BATCHDURATION_SEC))
                ((CUR_END_DATE_SEC=CUR_DATE_SEC-1))
                CUR_DATE_SEC_FORM_END=$(date -d @"${CUR_END_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")
                endlist+=("${CUR_DATE_SEC_FORM_END}")
            done
        fi
        # Add last date
        startlist+=("$(date -d @"${CUR_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")")
        endlist+=("${END}")


        ## Make sure not to concat previous call with new one. It needs to be handled manually.
        ## Otherwise, we might have multiple time the same result if we run the script with -b and
        ## but different duration or date leading to having multiple time the same jobs ?
        rm -f "${FILENAME}.jobs"

        ## consecutive sacct
        compt=-1
        for i in "${startlist[@]}";do
            ((compt+=1))
            SACCTOPTDATES=()
            SACCTOPTDATES+=("--starttime" "${i}" "--endtime" "${endlist[${compt}]}")
            if [[ ${compt} -gt 0 ]]; then
                SACCTOPTDATES+=("--noheader")
            fi
            "${SACCT}" "${SACCTOPT[@]}" "${SACCTOPTDATES[@]}" >> "${FILENAME}.jobs"
        done

    else
        SACCTOPT+=("--starttime" "${START}" "--endtime" "${END}")
        # TZ=UTC => For now we'll remove the timezone
        "${SACCT}" "${SACCTOPT[@]}" > "${FILENAME}.jobs"
    fi

    tgzfiles+=("${FILENAME}.jobs")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.jobs created"
    fi
fi

# Extract monitoring data (.nodes) if mode is 'all' or 'mon'
if [[ "${EXTRACT_MODE}" == "all" || "${EXTRACT_MODE}" == "mon" ]]; then
    "${SCONTROL}" -a -o -ddd show node > "${FILENAME}.nodes"
    tgzfiles+=("${FILENAME}.nodes")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.nodes created"
    fi
fi

# Extract partitions data (.partitions) only if mode is 'all'
if [[ "${EXTRACT_MODE}" == "all" ]]; then
    "${SCONTROL}" -o -a -ddd show part > "${FILENAME}.partitions"
    tgzfiles+=("${FILENAME}.partitions")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.partitions created"
    fi
fi

if [[ "${TARBALL}" -eq 1 ]]; then
  tar --force-local -zcf "${FILENAME}.tgz" "${tgzfiles[@]}"
  echo "${FILENAME}.tgz"
  rm -f "${tgzfiles[@]}"
fi

Note

The extraction scripts can retrieve both accounting and monitoring data. Use --data accounting when you only need accounting logs.

Monitoring

LSF — Use one of the following scripts to retrieve LSF monitoring data:

extractLSFNode_json.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################
# Run the commands and capture their output.
nodes_output=$(bhosts -X -o 'HOST_NAME status run max NJOBS' -alloc -json)
lsload_output=$(lsload -o 'HOST_NAME status r1m mem' -json)
lshosts_output=$(lshosts -o 'HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES ndisks rexpri nprocs ncores nthreads RUN_WINDOWS' -json)
# Create a file name based on the current date and time.
date_time=$(date +"%Y-%m-%d %H:%M:%S")
file_name="lsf_${date_time}.json"
# Replace spaces in the date_time with underscores for a valid filename.
file_name="${file_name// /_}"

# Write the outputs to the file with a delimiter between them.
{
  echo "["
  echo "${nodes_output}"
  echo ","
  echo "${lsload_output}"
  echo ","
  echo "${lshosts_output}"
  echo "]"
} > "${file_name}"
# Final message
echo "Data has been written to ${file_name}"

extractLSFNode.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################
# Run the commands and capture their output.
nodes_output=$(bhosts -X -o 'HOST_NAME status run max NJOBS' -alloc)
lsload_output=$(lsload -o 'HOST_NAME status r1m mem')
lshosts_output=$(lshosts -w)
# Create a file name based on the current date and time.
date_time=$(date +"%Y-%m-%d %H:%M:%S")
file_name="lsf_${date_time}.txt"
# Replace spaces in the date_time with underscores for a valid filename.
file_name="${file_name// /_}"

# Write the outputs to the file with a delimiter between them.
{
  echo "=== bhosts ==="
  echo "${nodes_output}"
  echo "=== lsload ==="
  echo "${lsload_output}"
  echo "=== lshosts ==="
  echo "${lshosts_output}"
} > "${file_name}"
# Final message
echo "Data has been written to ${file_name}"

PBS / OpenPBS / Torque — Use the following script:

extractPBSOccupancy.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################

# Create a file name based on the current date and time
date_time=$(date +"%Y-%m-%d %H:%M:%S")
file_name="pbs_${date_time}.json"
# Replace spaces in the date_time with underscores for a valid filename.
file_name="${file_name// /_}"

# Run the command and redirect the output to the JSON file
pbsnodes -a -F json > "${file_name}"


# Verify that the file was created and contains the expected content
if [[ -f "${file_name}" ]]; then
    echo "The output has been saved to ${file_name}."
else
    echo "Failed to create the output file."
    exit 1
fi

SLURM — Use the extraction script with --data monitoring:

extractSlurmData.sh

#!/bin/bash
################################################################################
# Copyright (c) 2017-2026 UCit SAS
# All Rights Reserved
#
# This software is the confidential and proprietary information
# of UCit SAS ("Confidential Information").
# You shall not disclose such Confidential Information
# and shall use it only in accordance with the terms of
# the license agreement you entered into with UCit.
################################################################################

START="2010-01-01T00:00:00"

FORMAT=""

CURDATE=$(date "+%Y-%m-%dT%H:%M:%S")
FILENAME="$(hostname)_${CURDATE}"
TARBALL=0
EXTRACT_MODE="all"  # Options: all, acct (accounting: .jobs), mon (monitoring: .nodes, .partitions)

# Setup basic access to slurm bin sacct and scontrol
DEFAULT_SLURM_PATH="/opt/slurm/bin"

if ! SCONTROL=$(command -v scontrol 2>/dev/null); then
    SCONTROL="${DEFAULT_SLURM_PATH}/scontrol"
fi

if ! SACCT=$(command -v sacct 2>/dev/null); then
    SACCT="${DEFAULT_SLURM_PATH}/sacct"
fi

while [[ $# -ge 1 ]]; do
    key="$1"

    case ${key} in
        -s|--start)
            START="$2"
            shift # past argument
            ;;
        -e|--end)
            END="$2"
            shift # past argument
            ;;
        -b|--batch)
            BATCHDURATION="$2"
            shift # past argument
            ;;
        -a|--all)
            FORMAT="ALL"
            ;;
        -o|--output)
            FILENAME="${2}"
            shift # past argument
            ;;
        -p|--path)
            # Force specific path to look for slurm bin sacct and scontrol
            SCONTROL="${2}/scontrol"
            SACCT="${2}/sacct"
            shift # past argument
            ;;
        -t|--tgz)
            TARBALL=1
            ;;
        -d|--data)
            case "$2" in
                all)
                    EXTRACT_MODE="all"
                    ;;
                acct|accounting)
                    EXTRACT_MODE="acct"
                    ;;
                mon|monitoring)
                    EXTRACT_MODE="mon"
                    ;;
                *)
                    echo "Unknown data mode: $2. Valid options: all, acct (accounting), mon (monitoring)"
                    exit 1
                    ;;
            esac
            shift # past argument
            ;;
        -h|--help)
            echo "$0 [-s|--start DATE_TIME] [-e|--end DATE_TIME] [-b|--batch DAYS] [-a|--all] [-d|--data MODE] [-o|--output FILENAME] [-t|--tgz] [-p|--path PATH] [-h|--help]"
            echo "Extract historical data from SLURM, and create the following files:"
            echo "- ${FILENAME}.jobs: historical data on jobs (accounting)"
            echo "- ${FILENAME}.nodes: current nodes description (monitoring)"
            echo "- ${FILENAME}.partitions: current partitions description (monitoring)"
            echo
            echo "-s|--start: start date and time (format: YYYY-MM-DDThh:mm:ss, default is ${START})"
            echo "-e|--end: end date and time (format: YYYY-MM-DDThh:mm:ss)"
            echo "-b|--batch: number of days for the batch size. Split whole duration in 'x' smaller batches to run consecutive"
            echo "  small sacct instead of a big request."
            echo "-a|--all: print all fields"
            echo "-d|--data: data extraction mode (default: all)"
            echo "  all: extract all data (.jobs, .nodes, .partitions)"
            echo "  acct|accounting: extract accounting data only (.jobs)"
            echo "  mon|monitoring: extract monitoring data only (.nodes)"
            echo "-o|--output: output filename (extensions will be added: .jobs, .nodes, .partitions)"
            echo "-t|--tgz: create a tarball and print its path to stdout"
            echo "-p|--path: path to slurm bin folder where scontrol and sacct can be found"
            echo "-h|--help: print this help"
            exit 0
            ;;
        *)
            echo "Unknown option: ${key}"
            exit 1
            ;;
    esac
    shift # past argument or value
done

# Slurm version
SLURM_VERSION=$("${SCONTROL}" --version| awk '{print $2}')
SV_MAJOR=$(echo "${SLURM_VERSION}" | cut -d '.' -f 1)

# Check if format=ALL requested
if [[ "${FORMAT}" != "ALL" ]]; then
    FIELDS=$("${SACCT}" --helpformat)
    # Then check based on available entry on slurm what we can actually retrieve.
    declare -a FIELDS_ARRAY
    while IFS=' ' read -r -a array;
    do
        FIELDS_ARRAY+=("${array[@]}");
    done < <(echo "${FIELDS}")

    declare -a UNWANTED_FIELDS=("User" "Group")

    for element in "${FIELDS_ARRAY[@]}"
    do
        if [[ ! ${UNWANTED_FIELDS[*]} =~ (^|[[:space:]])"${element}"($|[[:space:]]) ]]; then
            FORMAT+="${element},"
        fi
    done
fi

# Set sacct options depending on the version of SLURM
declare -a SACCTOPT
SACCTOPT=("--duplicates" "--allusers" "--parsable2" "--format" "${FORMAT}")

if [[ ${SV_MAJOR} -gt 14 ]]; then
    # supported since version 15
    SACCTOPT+=("--delimiter=@|@")
fi

# Set END if not defined
if [[ -z "${END}" ]]; then
    END=$(date +"%Y-%m-%dT%H:%M:%S")
fi

# Array to track generated files for tarball
declare -a tgzfiles=()

# Extract accounting data (.jobs) if mode is 'all' or 'acct'
if [[ "${EXTRACT_MODE}" == "all" || "${EXTRACT_MODE}" == "acct" ]]; then
    #sacct by batch or just once
    if [[ -n "${BATCHDURATION}" && "${BATCHDURATION}" -gt 0 ]]; then
        ## Compute start & end date lists
        # convert to seconds:
        START_SEC=$(date +%s --date "${START}")
        END_SEC=$(date +%s --date "${END}")
        BATCHDURATION_SEC=$((BATCHDURATION*(3600*24)))

        # create date lists
        startlist=()
        endlist=()
        # Slurm sacct works with inclusive limits [start, end]
        # https://rc.byu.edu/wiki/?id=Using+sacct
        # The algorithm does: [start, (start2-1second)], [start2, (start3-1second)], [start3, end]
        # Tested on Slurm 20.
        CUR_DATE_SEC=${START_SEC}
        if (( BATCHDURATION_SEC < (END_SEC-START_SEC) )); then  # Batchduration is smaller than the whole duration
            while ((CUR_DATE_SEC < (END_SEC-BATCHDURATION_SEC) ));do
                CUR_DATE_SEC_FORM=$(date -d @"${CUR_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")
                startlist+=("${CUR_DATE_SEC_FORM}")
                # get end date:
                ((CUR_DATE_SEC+=BATCHDURATION_SEC))
                ((CUR_END_DATE_SEC=CUR_DATE_SEC-1))
                CUR_DATE_SEC_FORM_END=$(date -d @"${CUR_END_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")
                endlist+=("${CUR_DATE_SEC_FORM_END}")
            done
        fi
        # Add last date
        startlist+=("$(date -d @"${CUR_DATE_SEC}" +"%Y-%m-%dT%H:%M:%S")")
        endlist+=("${END}")


        ## Make sure not to concat previous call with new one. It needs to be handled manually.
        ## Otherwise, we might have multiple time the same result if we run the script with -b and
        ## but different duration or date leading to having multiple time the same jobs ?
        rm -f "${FILENAME}.jobs"

        ## consecutive sacct
        compt=-1
        for i in "${startlist[@]}";do
            ((compt+=1))
            SACCTOPTDATES=()
            SACCTOPTDATES+=("--starttime" "${i}" "--endtime" "${endlist[${compt}]}")
            if [[ ${compt} -gt 0 ]]; then
                SACCTOPTDATES+=("--noheader")
            fi
            "${SACCT}" "${SACCTOPT[@]}" "${SACCTOPTDATES[@]}" >> "${FILENAME}.jobs"
        done

    else
        SACCTOPT+=("--starttime" "${START}" "--endtime" "${END}")
        # TZ=UTC => For now we'll remove the timezone
        "${SACCT}" "${SACCTOPT[@]}" > "${FILENAME}.jobs"
    fi

    tgzfiles+=("${FILENAME}.jobs")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.jobs created"
    fi
fi

# Extract monitoring data (.nodes) if mode is 'all' or 'mon'
if [[ "${EXTRACT_MODE}" == "all" || "${EXTRACT_MODE}" == "mon" ]]; then
    "${SCONTROL}" -a -o -ddd show node > "${FILENAME}.nodes"
    tgzfiles+=("${FILENAME}.nodes")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.nodes created"
    fi
fi

# Extract partitions data (.partitions) only if mode is 'all'
if [[ "${EXTRACT_MODE}" == "all" ]]; then
    "${SCONTROL}" -o -a -ddd show part > "${FILENAME}.partitions"
    tgzfiles+=("${FILENAME}.partitions")
    if [[ "${TARBALL}" -ne 1 ]]; then
      echo "${FILENAME}.partitions created"
    fi
fi

if [[ "${TARBALL}" -eq 1 ]]; then
  tar --force-local -zcf "${FILENAME}.tgz" "${tgzfiles[@]}"
  echo "${FILENAME}.tgz"
  rm -f "${tgzfiles[@]}"
fi

Note

Use --data monitoring when you only need monitoring logs.