isONcorrect icon indicating copy to clipboard operation
isONcorrect copied to clipboard

Parallel processors mode is not working

Open jordana-olive opened this issue 1 year ago • 14 comments

Hi guys, I'm trying to run with parallel processors, but I realized that it is not working. It is running with only one processor, that is why it is taking forever. What should I do? Do I need to set something up during the installation?

My command line (it is a 40 processors with 500GB RAM server): (The conda environment is activated) run_isoncorrect --t 20 --fastq_folder 01-isonclustering/02-clustered-fastq/ --outfolder 03-ONT-fastq-corrected

jordana-olive avatar Nov 23 '22 15:11 jordana-olive

I just saw the specifications again. I'll try to run from .sh script.

jordana-olive avatar Nov 23 '22 15:11 jordana-olive

Ok, great. It should work with multiple cores using the run_isoncorrect --t 20 command. Let me know how it goes.

ksahlin avatar Nov 23 '22 17:11 ksahlin

Hi Sahlin, I realized that the last clusters are running with fewer processors than the first ones. Now it is taking forever (~8 hours per cluster). Are these clusters the longest ones?

Screenshot from 2022-11-28 09-22-51

jordana-olive avatar Nov 28 '22 14:11 jordana-olive

If you have a few very large clusters, you can(/should) use --split_wrt_batches.

According to the documentation, this option

--split_wrt_batches   Process reads per batch (of max_seqs sequences) 
                       instead of per cluster. Significantly decrease runtime when few 
                       very large clusters are less than the number of cores used.

Here max_seqs is typically 1000 or 2000, this speeds it up a lot when few very large clusters are present. We used this mode for the SIRV dataset (in the paper) which had one of the clusters being half of the reads.

ksahlin avatar Nov 28 '22 14:11 ksahlin

Hi Sahlin, I canceled my last script (without --split_wrt_branches), but now I have another issue. It seems stopped in the last cluster, and the program can not finish properly. I run the test data (100 reads) and it worked well. I don't know what is going on:

My script: #!/bin/bash

Pipeline to get high-quality full-length reads from ONT cDNA sequencing

Set path to output and number of cores

root_out="03-correction" cores=20 mkdir -p $root_out run_isoncorrect --t $cores --fastq_folder 01-isonclustering/02-clustered-fastq/ --outfolder $root_out --split_wrt_batches

The error:

Running isoncorrect batch_id:100000_0... multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 94, in isoncorrect subprocess.check_call([ "/usr/bin/time", isoncorrect_exec, "--fastq", read_fastq_file, "--outfolder", outfolder, File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/time', '/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect', '--fastq', '/tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq', '--outfolder', '03-correction/100000_0', '--exact_instance_limit', '50', '--max_seqs', '2000', '--k', '9', '--w', '20', '--xmin', '18', '--xmax', '80', '--T', '0.1']' returned non-zero exit status 1. """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 365, in main(args) File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 281, in main for x in pool.imap_unordered(isoncorrect, instances): File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value subprocess.CalledProcessError: Command '['/usr/bin/time', '/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect', '--fastq', '/tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq', '--outfolder', '03-correction/100000_0', '--exact_instance_limit', '50', '--max_seqs', '2000', '--k', '9', '--w', '20', '--xmin', '18', '--xmax', '80', '--T', '0.1']' returned non-zero exit status 1.

jordana-olive avatar Nov 30 '22 17:11 jordana-olive

Hi Sahlin, I canceled my last script (without --split_wrt_branches), but now I have another issue. It seems stopped in the last cluster, and the program can not finish properly. I run the test data (100 reads) and it worked well. I don't know what is going on:

My script: #!/bin/bash

Pipeline to get high-quality full-length reads from ONT cDNA sequencing

Set path to output and number of cores

root_out="03-correction" cores=20 mkdir -p $root_out run_isoncorrect --t $cores --fastq_folder 01-isonclustering/02-clustered-fastq/ --outfolder $root_out --split_wrt_batches

The error:

Running isoncorrect batch_id:100000_0... multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 94, in isoncorrect subprocess.check_call([ "/usr/bin/time", isoncorrect_exec, "--fastq", read_fastq_file, "--outfolder", outfolder, File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/time', '/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect', '--fastq', '/tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq', '--outfolder', '03-correction/100000_0', '--exact_instance_limit', '50', '--max_seqs', '2000', '--k', '9', '--w', '20', '--xmin', '18', '--xmax', '80', '--T', '0.1']' returned non-zero exit status 1. """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 365, in main(args) File "/home/eniac/miniconda3/envs/isoncorrect/bin/run_isoncorrect", line 281, in main for x in pool.imap_unordered(isoncorrect, instances): File "/home/eniac/miniconda3/envs/isoncorrect/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value subprocess.CalledProcessError: Command '['/usr/bin/time', '/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect', '--fastq', '/tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq', '--outfolder', '03-correction/100000_0', '--exact_instance_limit', '50', '--max_seqs', '2000', '--k', '9', '--w', '20', '--xmin', '18', '--xmax', '80', '--T', '0.1']' returned non-zero exit status 1.

*update, with the test data (100 reads), it did not work as well. But as it is a small dataset, the program was able to generate the final fastq for each cluster.

jordana-olive avatar Nov 30 '22 18:11 jordana-olive

If the file /tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq is still there, could you try running:

/usr/bin/time /home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect --fastq \ 
           /tmp/tmpl_shyp5i/split_in_batches/100000_0.fastq --outfolder 03-correction/100000_0 \
            --exact_instance_limit 50 --max_seqs 2000 --k 9 --w 20 --xmin 18 --xmax 80 --T 0.1

This is the instance that generates an error.

ksahlin avatar Nov 30 '22 19:11 ksahlin

Perhaps isONcorrect also logs the error for this in a file .stderr somewhere in the output older in 03-correction/100000_0. I forgot if i Implemented that. In that case you could check the error in that file.

ksahlin avatar Nov 30 '22 19:11 ksahlin

Thank you for replying. I realized that depends on python version the error change a little bit, but it is still not working. I've tried to install via github (python 3.8), via conda is 3.11 automatically. Then, I reinstalled via conda forcing python=3.10 version, it seems that run more clusters, but I still have the same issue above.

I also realized that just few clusters were done, so it is not viable to run cluster by cluster in the tmp folder (and as I had this issue, I'm not sure if all my clusters were there).

It seems that some clusters I got the final run, others stopped in the middle of the process and others just crashed. If you have some idea what is going on, I really appreciate that. Thanks :)

Plus, the stderr file in the failed clusters said:

Traceback (most recent call last): File "/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect", line 1551, in main(args) File "/home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect", line 1213, in main all_reads = { i + 1 : (acc, seq, qual) for i, (acc, (seq, qual)) in enumerate(help_functions.readfq(open(args.fastq, 'r')))} FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpiq04o1ed/split_in_batches/100000_0.fastq' Command exited with non-zero status 1 0.62user 1.60system 0:00.27elapsed 805%CPU (0avgtext+0avgdata 36808maxresident)k 0inputs+0outputs (0major+5432minor)pagefaults 0swaps

jordana-olive avatar Nov 30 '22 22:11 jordana-olive

The error you reported is just because the file is not there anymore (these files get flushed from the tmp folder regularly by the system). It is not the actual error you encounter when running.

Another guess: remove the output folder 03-correction, perhaps some old files there interfere with your output from other attempts.

Otherwise, perhaps you could copy the offending file from the temp folder when it is present and run the command on that file as

/usr/bin/time /home/eniac/miniconda3/envs/isoncorrect/bin/isONcorrect --fastq \ 
           THE_FAILING_TMP_FILE.fastq --outfolder 03-correction/100000_0 \
            --exact_instance_limit 50 --max_seqs 2000 --k 9 --w 20 --xmin 18 --xmax 80 --T 0.1

isONcorrect will let you know at start of the run which tmp folder it is working in by writing Temporary workdirectory: [HERE IS THE PATH]

ksahlin avatar Dec 01 '22 07:12 ksahlin

Hi, Sahlin. Yes, I checked the last issues, my error is similar to other, with tmp folder and etc... (I'm deleting the previous out file before running). I checked the run_isoncorrect script and found the lines "/usr/bin/time", now replaced to "time", but I see that the symbolic link can't be opened in tmp folder: (my version is 0.0.8).

As I noticed that several clusters are with the same problem, I'm not sure if is viable to copy and run all that by using isONcorrect.

When I check the fail cluster, I have this: ls -lah /tmp/tmpzgp8a2tb/split_in_batches/100000_0.fastq lrwxrwxrwx 1 eniac eniac 49 Dec 1 11:04 /tmp/tmpzgp8a2tb/split_in_batches/100000_0.fastq -> 01-isonclustering/02-clustered-fastq/100000.fastq

Nonetheless, I can't open the file. head /tmp/tmpzgp8a2tb/split_in_batches/100000_0.fastq head: cannot open '/tmp/tmpzgp8a2tb/split_in_batches/100000_0.fastq' for reading: No such file or directory

I'll keep trying to solve this. If you have some tip, please, I appreciate that. Thanks

jordana-olive avatar Dec 01 '22 16:12 jordana-olive

Not sure why you have a symbolic link to the file?

You need to copy it completely cp /tmp/tmpzgp8a2tb/split_in_batches/100000_0.fastq THE_FAILING_TMP_FILE.fastq.

At the moment you only seem to have a symbolic link, which means that if the tmp file disappear you no longer have access to it.

ksahlin avatar Dec 01 '22 16:12 ksahlin

I think we are not discussing the same page... The program create the symlink to the files to run --split_in_batches...

jordana-olive avatar Dec 01 '22 17:12 jordana-olive

Okay, how about this.

On line 218 in run_isoncorrect here, please change this line to tmp_work_dir = "XX-correction/" (or whatever path you want on your system). This way all the files will be present in XX-correction/ and you have them there should anything break along the way.

Then we can locate the file(s) where it is going wrong and run them individually for error message.

ksahlin avatar Dec 02 '22 08:12 ksahlin