Higashi
Higashi copied to clipboard
Stop with OSError when run "higashi_model.train_for_imputation_nbr_0()"
Hi Ruochiz,
Higashi run very well without any errors when the resolution was 1M (JSON file option "resolution") in my CentOS 7 system.
However, when the resolution increased, no matter which resolution, there was always an OSError as follows after the step "higashi_model.train_for_imputation_nbr_0()":
[ Epoch 42 of 45 ]
- (Train) bce: 0.3479, mse: 0.0000, acc: 96.450 %, pearson: 0.571, spearman: 0.634, elapse: 97.359 s
- (Valid) bce: 2.7560, acc: 97.025 %,pearson: 0.187, spearman: 0.635,elapse: 0.296 s no improve: 1 [ Epoch 43 of 45 ]
- (Train) bce: 0.3542, mse: 0.0000, acc: 96.321 %, pearson: 0.557, spearman: 0.633, elapse: 96.495 s
- (Valid) bce: 3.0346, acc: 96.726 %,pearson: 0.123, spearman: 0.633,elapse: 0.355 s no improve: 2 [ Epoch 44 of 45 ]
- (Train) bce: 0.3466, mse: 0.0000, acc: 96.488 %, pearson: 0.599, spearman: 0.634, elapse: 99.352 s
- (Valid) bce: 3.6074, acc: 97.016 %,pearson: 0.157, spearman: 0.636,elapse: 0.356 s
no improve: 3
- (Validation) : 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "
", line 1, in File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1367, in train_for_imputation_nbr_0 self.train( File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1141, in train valid_bce_loss, valid_accu, valid_auc1, valid_auc2, _, _ = self.eval_epoch(validation_data_generator) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 994, in eval_epoch pool = ProcessPoolExecutor(max_workers=cpu_num) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 658, in init self._result_queue = mp_context.SimpleQueue() File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/context.py", line 113, in SimpleQueue return SimpleQueue(ctx=self.get_context()) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/queues.py", line 340, in init self._reader, self._writer = connection.Pipe(duplex=False) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/connection.py", line 527, in Pipe fd1, fd2 = os.pipe() OSError: [Errno 24] Too many open files
- (Validation) : 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "
Would you please let me know the reason of issue?
Thanks a lot.
Yufan (Harry) Zhou
Hum. Could you try to re-run that with fewer cpu workers.
Or you can try to do this, to increase the maximum number of open files:
# Check current limit
$ ulimit -n
256
# Raise limit to 2048
# Only affects processes started from this shell
$ ulimit -n 2048
$ ulimit -n
2048
Hi Ruochi, thank you so much for your reply. I have increased the number with the command “ulimit -n 4096” and only use 8 CPU in a total 128-CPU server. But Higashi still doesn’t work with the same error as before. I also contacted Dr. Jian Ma for helps and he suggested me to continue to discuss with you on GitHub. Would you please help me to solve this issue? Thanks.
Hum. I must say this error is really strange, but looks like due to how python multiprocessing is handled by linux system. Do you notice any memory being used up when the error shows? It's possible that the system is writing to swap partition when running out of memory
Same issue.
$ ulimit -n
1048576
The code set a low value, you can change this to a larger one, or comment this line to cancel this limit. https://github.com/ma-compbio/Higashi/blob/1333de29ac1d808906d81409176c7dbd0cf2558f/higashi/Higashi_wrapper.py#L452
Solved!
Oh, I see, thx for spotting this. I'll increase that in the code as well.