NGINX
Solution
There is an Nginx configuration file for OKA located in ${OKA_INSTALL_DIR}/conf
. Copy this file into Nginx configuration directory:
sudo cp ${OKA_INSTALL_DIR}/conf/oka.nginx.conf /etc/nginx/conf.d/
sudo systemctl restart nginx
Disable SELinux by either:
Issue - The layout of OKA interface is missing and no images are displayed
Nginx does not serve OKA files properly.
Solution
The user running Nginx must have read and execute access on the whole folder tree to ${OKA_INSTALL_DIR}/data
.
For example, if OKA is installed in /opt
, the Nginx user must have read and execute rights on /opt
folder to be able
to navigate to ${OKA_INSTALL_DIR}/data
directory.
Either change the Nginx user to a user with the correct access rights.
Or change the directories rights to give read and execute rights to the Nginx user.
The user running Nginx is system specific (it might be nginx
, www-data
…).
Reason
The Nginx configuration is set to allow 8k maximum for a request when the current request is probably of a bigger size and is therefore not allowed to be processed.
Solution
You can update you current Nginx configuration to allow for larger requests by adding or updating the following option:
Issue - 403 CSRF verification failed
Connection to OKA is not possible due to Django rejecting CSRF token validation.
Solution
Update your OKA configuration to trust your current instance domain either in http or https depending on your Nginx configuration.
Gunicorn
Reason
Configuration of the TIMEOUT
parameter is probably too short in oka.nginx.conf
and/or gunicorn.conf
to increase
the reading time out of the Gunicorn.
Elasticsearch
Issue - java.lang.IllegalStateException: failed to obtain node locks / java.io.IOException: No locks available
Problem while accessing Elasticsearch files.
Solution
If you must stick with NFS, use NFSv4 instead of NFSv3.
Warning
Elasticsearch had been know to work on NFSv4, but this is not an officially supported deployment.
Issue - Limit of total fields [1000] has been exceeded
Problem while training Predict-IT.
Issue - Out of memory
Not enough memory.
Reason
When training many predictors at the same time in Predict-IT or doing heavy analysis, it can happen that Elasticsearch runs out of memory and crashes.
One way to prevent a downtime of OKA is to setup Elasticsearch services to automatically restart Elasticsearch in case of failure.
Solution
Automatically restart Elasticsearch in case of failure.
You need to add the following configurations in the Elasticsearch Systemd service file (e.g., /lib/systemd/system/elasticsearch.service
):
[ Unit ]
...
StartLimitIntervalSec = 300
StartLimitBurst = 5
[ Service ]
...
Restart = on - failure
RestartSec = 5
Then reload Systemd: sudo systemctl daemon-reload
.
You can test the auto-restart by killing the Elasticsearch process: kill -9 <ES_PID>
,
and checking that the service restarts: systemctl status elasticsearch
.
Log Ingestion
Issue - Expected X fields in line Y, saw Z. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Problem while ingesting Slurm
logs.
Reason
Parsing logs extracted from Slurm
accounting is done using a separator
character to first identify all available fields in the data file.
By default, the separator
used by Slurm
is set to |
so when calling sacct
to extract the logs you will receive something like this:
Because users can still use the |
character in their submit command lines or job names etc. you might find yourself in cases where an extra separator
exists within the data. You can therefore end up with some cases like those:
field0 | fie | ld1 | field2
or
field0 | fie_ | _ld1 | field2
Solution
The Slurm
parser will handle most cases automatically however, we still cannot guaranty that all of them will be detected.
To avoid this type of problems we recommend to either:
Add this option to your sacct
call to specify a custom separator: --delimiter=@|@
Directly use our extraction script (see Retrieve job scheduler logs )
Let OKA handle the extraction by connecting it to the scheduler. For this you need to change the Command type
in the configuration for your cluster in DATA MANAGER > Conf job schedulers
to be LOCAL
, FORWARDED_PWD
or FORWARDED_KEYFILE
(see Other variables for more info regarding those parameters)
Issue - Data are displayed on the wrong timezone.
Problem while visualizing Slurm
logs ingested directly through an sacct
call.
Reason
Due to the way sacct
works, the datetime data retrieved when calling the scheduler are set in the timezone of the current environment from where the call originated.
When creating a cluster (see Clusters management ), the default timezone is set to UTC
in its configuration.
Solution
In order to fix this, you can switch to the appropriate timezone in the cluster configuration (see Cluster configurations ) accessible through the admin panel.
You will then have to clean your existing data affected by the previous upload on the wrong timezone and let OKA call sacct
again with the new configuration.