sisyphus
sisyphus copied to clipboard
`FileExistsError` in `Job._sis_setup_directory`
...
[2023-12-05 04:29:29,576] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/
i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>
[2023-12-05 04:29:29,576] INFO: interrupted_resumable(1) retry_error(4) running(8) waiting(663)
[2023-12-05 04:31:04,825] ERROR: Exception in thread <_MainThread(MainThread, started 140708433694720)>:
EXCEPTION
Traceback (most recent call last):
File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func
line: return func(*args, **kwargs)
locals:
func = <local> <function Manager.run at 0x7ff93ae65940>
args = <local> (<Manager(Thread-2, initial)>,)
kwargs = <local> {}
File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 617, in Manager.run
line: self.resume_jobs()
locals:
self = <local> <Manager(Thread-2, initial)>
self.resume_jobs = <local> <bound method Manager.resume_jobs of <Manager(Thread-2, initial)>>
File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 405, in Manager.resume_jobs
line: self.thread_pool.map(f, self.jobs.get(gs.STATE_INTERRUPTED_RESUMABLE, []))
locals:
self = <local> <Manager(Thread-2, initial)>
self.thread_pool = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>
self.thread_pool.map = <local> <bound method Pool.map of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>
f = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>
self.jobs = <local> defaultdict(<class 'set'>, {'waiting': {Job<work/i6_core/returnn/search/SearchWordsToCTMJob.sHh83NBWaNtR>, Job<work/i6_
core/recognition/scoring/ScliteJob.TThbUgE8qjSd>, Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.uX3OywkERCr7>, Job<work/i6_core/returnn/se
arch/SearchRemoveLabelJob.cakEqUo..., len = 5, _[0]: {len = 0}
self.jobs.get = <local> <built-in method get of collections.defaultdict object at 0x7ff8ac589c60>
gs = <global> <module 'sisyphus.global_settings' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/global_settings.py'>
gs.STATE_INTERRUPTED_RESUMABLE = <global> 'interrupted_resumable', len = 21
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 367, in Pool.map
line: return self._map_async(func, iterable, mapstar, chunksize).get()
locals:
self = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>
self._map_async = <local> <bound method Pool._map_async of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>
func = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>
iterable = <local> {Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/Returnn
TrainingJob.jyQaF3P8Ieol>}, len = 1
mapstar = <global> <function mapstar at 0x7ff93b797d80>
chunksize = <local> None
get = <not found>
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 774, in ApplyResult.get
line: raise self._value
locals:
self = <local> <multiprocessing.pool.MapResult object at 0x7ff8ac510a90>
self._value = <local> FileExistsError(17, 'File exists')
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 125, in worker
line: result = (True, func(*args, **kwds))
locals:
result = <local> None
func = <local> None
args = <local> None
kwds = <local> None
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
line: return list(map(*args))
locals:
list = <builtin> <class 'list'>
map = <builtin> <class 'map'>
args = <local> (<function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>, (Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_3
0/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>,))
File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 399, in Manager.resume_jobs.<locals>.f
line: job._sis_setup_directory(force=True)
locals:
job = <local> Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTraini
ngJob.jyQaF3P8Ieol>
job._sis_setup_directory = <local> <bound method Job._sis_setup_directory of Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30
/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>>
force = <not found>
File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 284, in Job._sis_setup_directory
line: os.symlink(src=os.path.abspath(str(creator._sis_path())), dst=link_name, target_is_directory=True)
locals:
os = <global> <module 'os' (frozen)>
os.symlink = <global> <built-in function symlink>
src = <not found>
os.path = <global> <module 'posixpath' (frozen)>
os.path.abspath = <global> <function abspath at 0x7ff93be74f40>
str = <builtin> <class 'str'>
creator = <local> Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>
creator._sis_path = <local> <bound method Job._sis_path of Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>>
dst = <not found>
link_name = <local> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8STWt', len = 136
target_is_directory = <not found>
FileExistsError: [Errno 17] File exists: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56N
Z8STWt' -> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8S
TWt'
[2023-12-05 04:31:05,077] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140708182750784)>}
This is the first time I see this. Probably a very rare issue.
After a restart of the manager, I don't see the problem anymore.
Mh.. I have never seen this before.
Do you have two managers running simultaneously that race to create the same work folder?
Do you have two managers running simultaneously that race to create the same work folder?
No.
I have seen it before, but I am not 100% sure anymore how this was caused. It might have been during the FS problem times on asr3 but i cant tell for sure.
My first guess would also have been multiple managers or some filesystem problems. The function should not be called in parallel inside sisyphus for the same job. It's called here: https://github.com/rwth-i6/sisyphus/blob/4c3b40f289110bef30e25662f50593898067c0e3/sisyphus/manager.py#L385
Let us know if this problem reappears.