NGINX
Solution
There is an Nginx configuration file for OKA located in ${OKA_INSTALL_DIR}/conf . Copy this file into Nginx configuration directory:
sudo cp ${OKA_INSTALL_DIR}/conf/oka.nginx.conf /etc/nginx/conf.d/
sudo systemctl restart nginx
Disable SELinux by either:
Issue - The layout of OKA interface is missing and no images are displayed
Nginx does not serve OKA files properly.
Solution
The user running Nginx must have read and execute access on the whole folder tree to ${OKA_INSTALL_DIR}/data .
For example, if OKA is installed in /opt , the Nginx user must have read and execute rights on /opt folder to be able
to navigate to ${OKA_INSTALL_DIR}/data directory.
Either change the Nginx user to a user with the correct access rights.
Or change the directories rights to give read and execute rights to the Nginx user.
The user running Nginx is system specific (it might be nginx , www-data …).
Reason
The Nginx configuration is set to allow 8k maximum for a request when the current request is probably of a bigger size and is therefore not allowed to be processed.
Solution
You can update you current Nginx configuration to allow for larger requests by adding or updating the following option:
Issue - 403 CSRF verification failed
Connection to OKA is not possible due to Django rejecting CSRF token validation.
Solution
Update your OKA configuration to trust your current instance domain either in http or https depending on your Nginx configuration.
Gunicorn
Reason
Configuration of the TIMEOUT parameter is probably too short in oka.nginx.conf and/or gunicorn.conf to increase
the reading time out of the Gunicorn.
Elasticsearch
Issue - java.lang.IllegalStateException: failed to obtain node locks / java.io.IOException: No locks available
Problem while accessing Elasticsearch files.
Solution
If you must stick with NFS, use NFSv4 instead of NFSv3.
Warning
Elasticsearch had been know to work on NFSv4, but this is not an officially supported deployment.
Issue - Limit of total fields [1000] has been exceeded
Problem while training Predictor.
Issue - Out of memory
Not enough memory.
Reason
When training many predictors at the same time in Predictor or doing heavy analysis, it can happen that Elasticsearch runs out of memory and crashes.
One way to prevent a downtime of OKA is to setup Elasticsearch services to automatically restart Elasticsearch in case of failure.
Solution
Automatically restart Elasticsearch in case of failure.
You need to add the following configurations in the Elasticsearch Systemd service file (e.g., /lib/systemd/system/elasticsearch.service ):
[ Unit ]
...
StartLimitIntervalSec = 300
StartLimitBurst = 5
[ Service ]
...
Restart = on - failure
RestartSec = 5
Then reload Systemd: sudo systemctl daemon-reload .
You can test the auto-restart by killing the Elasticsearch process: kill -9 <ES_PID> ,
and checking that the service restarts: systemctl status elasticsearch .
Log Ingestion
Issue - Expected X fields in line Y, saw Z. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Problem while ingesting Slurm logs.
Reason
Parsing logs extracted from Slurm accounting is done using a separator character to first identify all available fields in the data file.
By default, the separator used by Slurm is set to | so when calling sacct to extract the logs you will receive something like this:
Because users can still use the | character in their submit command lines or job names etc. you might find yourself in cases where an extra separator exists within the data. You can therefore end up with some cases like those:
field0 | fie | ld1 | field2
or
field0 | fie_ | _ld1 | field2
Solution
The Slurm parser will handle most cases automatically however, we still cannot guaranty that all of them will be detected.
To avoid this type of problems we recommend to either:
Add this option to your sacct call to specify a custom separator: --delimiter=@|@
Directly use our extraction script (see Retrieve job scheduler logs )
Let OKA handle the extraction by connecting it to the scheduler. For this you need to change the Command type in the configuration for your cluster in DATA MANAGER > Conf job schedulers to be LOCAL , FORWARDED_PWD or FORWARDED_KEYFILE (see Other variables for more info regarding those parameters)
Issue - Data are displayed on the wrong timezone.
Problem while visualizing Slurm logs ingested directly through an sacct call.
Reason
Due to the way sacct works, the datetime data retrieved when calling the scheduler are set in the timezone of the current environment from where the call originated.
When creating a cluster (see Clusters management ), the default timezone is set to UTC in its configuration.
Solution
In order to fix this, you can switch to the appropriate timezone in the cluster configuration (see Cluster configurations ) accessible through the admin panel.
You will then have to clean your existing data affected by the previous upload on the wrong timezone and let OKA call sacct again with the new configuration.