Higashi icon indicating copy to clipboard operation
Higashi copied to clipboard

Stop with OSError when run "higashi_model.train_for_imputation_nbr_0()"

Open yufanzhouonline opened this issue 1 year ago • 6 comments

Hi Ruochiz,

Higashi run very well without any errors when the resolution was 1M (JSON file option "resolution") in my CentOS 7 system.

However, when the resolution increased, no matter which resolution, there was always an OSError as follows after the step "higashi_model.train_for_imputation_nbr_0()":

[ Epoch 42 of 45 ]

  • (Train) bce: 0.3479, mse: 0.0000, acc: 96.450 %, pearson: 0.571, spearman: 0.634, elapse: 97.359 s
  • (Valid) bce: 2.7560, acc: 97.025 %,pearson: 0.187, spearman: 0.635,elapse: 0.296 s no improve: 1 [ Epoch 43 of 45 ]
  • (Train) bce: 0.3542, mse: 0.0000, acc: 96.321 %, pearson: 0.557, spearman: 0.633, elapse: 96.495 s
  • (Valid) bce: 3.0346, acc: 96.726 %,pearson: 0.123, spearman: 0.633,elapse: 0.355 s no improve: 2 [ Epoch 44 of 45 ]
  • (Train) bce: 0.3466, mse: 0.0000, acc: 96.488 %, pearson: 0.599, spearman: 0.634, elapse: 99.352 s
  • (Valid) bce: 3.6074, acc: 97.016 %,pearson: 0.157, spearman: 0.636,elapse: 0.356 s no improve: 3
    • (Validation) : 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last): File "", line 1, in File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1367, in train_for_imputation_nbr_0 self.train( File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1141, in train valid_bce_loss, valid_accu, valid_auc1, valid_auc2, _, _ = self.eval_epoch(validation_data_generator) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 994, in eval_epoch pool = ProcessPoolExecutor(max_workers=cpu_num) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 658, in init self._result_queue = mp_context.SimpleQueue() File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/context.py", line 113, in SimpleQueue return SimpleQueue(ctx=self.get_context()) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/queues.py", line 340, in init self._reader, self._writer = connection.Pipe(duplex=False) File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/connection.py", line 527, in Pipe fd1, fd2 = os.pipe() OSError: [Errno 24] Too many open files

Would you please let me know the reason of issue?

Thanks a lot.

Yufan (Harry) Zhou

yufanzhouonline avatar Aug 04 '23 23:08 yufanzhouonline

Hum. Could you try to re-run that with fewer cpu workers.

Or you can try to do this, to increase the maximum number of open files:

# Check current limit
$ ulimit -n
256

# Raise limit to 2048
# Only affects processes started from this shell
$ ulimit -n 2048

$ ulimit -n
2048

ruochiz avatar Aug 14 '23 17:08 ruochiz

Hi Ruochi, thank you so much for your reply. I have increased the number with the command “ulimit -n 4096” and only use 8 CPU in a total 128-CPU server. But Higashi still doesn’t work with the same error as before. I also contacted Dr. Jian Ma for helps and he suggested me to continue to discuss with you on GitHub. Would you please help me to solve this issue? Thanks.

yufanzhouonline avatar Sep 21 '23 22:09 yufanzhouonline

Hum. I must say this error is really strange, but looks like due to how python multiprocessing is handled by linux system. Do you notice any memory being used up when the error shows? It's possible that the system is writing to swap partition when running out of memory

ruochiz avatar Sep 24 '23 19:09 ruochiz

Same issue.

$ ulimit -n
1048576

seraphzl avatar Aug 29 '24 09:08 seraphzl

The code set a low value, you can change this to a larger one, or comment this line to cancel this limit. https://github.com/ma-compbio/Higashi/blob/1333de29ac1d808906d81409176c7dbd0cf2558f/higashi/Higashi_wrapper.py#L452

Solved!

seraphzl avatar Aug 29 '24 14:08 seraphzl

Oh, I see, thx for spotting this. I'll increase that in the code as well.

ruochiz avatar Aug 29 '24 15:08 ruochiz