Batch Exit Code 126 / "Permission denied" on Low-Priority Nodes

Problem Description

We are experiencing a rare but annoying problem in Azure Batch that manifests itself in the error message "Permission denied". We are using a pool consisting of up to 50 nodes (for cost reduction almost exclusively low-priority nodes) and Docker with 5 containers on each node. Inside these containers a shell script starts a Java program that processes data for roughly 5 to 10 minutes.

At some point all tasks start to fail immediately on one node. We assume that a low-priority node got into the preempted state, returned back and then the node setup fails. This prohibits to run our software succesfully and leads to failing tasks.

Up to now we were not able to find a suitable workaround for this problem or documentation for the special behavior of returning low-priority nodes.

Steps to Reproduce

Start a Batch pool (autopool) with a reasonable amount of low-priority nodes (we suggest at least 20 nodes) of reasonable size (we use F32s_v2)
Run this pool for hours and constantly submit new tasks

After a certain time (a few hours at maximum) one node will start to fail all tasks immediately.

Expected Results

All tasks on all nodes finish successfully.

Actual Results

Over time one node will be affected by this problem. On the affected node our software is not started, instead all tasks fail immeadiately with exit code 126. The log file fileuploaderr.txt contains the following log:

Traceback (most recent call last):
  File "batchfileuploader.py", line 159, in <module>
  File "batchfileuploader.py", line 104, in main
  File "batchfileuploader.py", line 48, in load_specification_from_file
PermissionError: [Errno 13] Permission denied: '/mnt/batch/tasks/workitems/20230420_1429_MGTP_82184790148a4_JOB/job-1/20230420_1437_GP_01c20b9e9c4d4d1dbec2cc4f67c5d/uploadconfig-f417492d-d3a2-4893-976a-3936ac2dc7d7.json'
[11899] Failed to execute script 'batchfileuploader' due to unhandled exception!

The script batchfileuploader.py is not our own code but something built-in from Batch. Except this error message we could not find any differences to healthy nodes, files and permissions are the same.

Additional Logs

The log files (agent-debug.log, agent-warn.log, controller-debug.log and controller-warn.log) from one affected node are uploaded here: https://storage.blob.core.windows.net/batch-exit-code-126/batch-22EFF6BA9ACA8B10.zip?sv=2021-10-04&spr=https%2Chttp&st=2023-05-01T00%3A00%3A00Z&se=2024-05-31T00%3A00%3A00Z&sr=b&sp=r&sig=XXX

Additonal Comments

We strongly suspect that only low-priority nodes are affected, because multiple test runs of similar configured Batch pools with exclusively dedicated nodes never had this problem.

Additionally to the five configured Docker containers on the Batch nodes we manually create another container (containing a PostgreSQL database) during node setup. This is probaly neither well tested nor widely used but worked without any problems so far.

May 22 '23 12:05 FridoDeluxe

@FridoDeluxe Thanks for the report! I've seen one other report similar to this come my way but there were no logs for us to search. I'll take a look next I have a chance, knowing it's maybe low-prio-related helps.

May 22 '23 15:05 staer

I've downloaded and attached logs to our internal tracking board, feel free to remove for privacy reasons.

May 22 '23 15:05 staer

@staer Just a quick question: Are there any insights or progress on this problem yet?

Jun 23 '23 09:06 FridoDeluxe