LongQC icon indicating copy to clipboard operation
LongQC copied to clipboard

OSError: [Errno 5] Input/output error

Open wzhang42 opened this issue 2 years ago • 4 comments

Hi, yfukasawa, I am running LongQC for a batch of my Pacbio unaligned BAMs. For the small samples, it run smoothly. But for one of a little bit large BAM file, it take a very long time and finally report OSE error. My command is longqc sampleqc -x pb-sequel -p 8 -o HG002_90pM_read_LongQC ./m64304e_211014_201856.reads.bam I cut the error related message as the following. Could you help me figure out the reason? Additionally, it take close two weeks to run this sample and get such an OSError, whether I can specific more nodes (such as -p 32) to linearly speed up? Thank you in advance. Wenchao
File "/opt/LongQC/longQC.py", line 63, in main args.handler(args) File "/opt/LongQC/longQC.py", line 829, in command_sample tpl = env.get_template('web_summary.tpl.html') File "/opt/conda/lib/python3.9/site-packages/jinja2/environment.py", line 997, in get_template File "/opt/conda/lib/python3.9/site-packages/jinja2/environment.py", line 958, in _load_template File "/opt/conda/lib/python3.9/site-packages/jinja2/loaders.py", line 125, in load File "/opt/conda/lib/python3.9/site-packages/jinja2/loaders.py", line 201, in get_source OSError: [Errno 5] Input/output error

wzhang42 avatar Apr 08 '22 15:04 wzhang42

BTW, I run LongQC at LSF cluster. And the version is : LongQC 1.2.0c Wish to hear from you.

wzhang42 avatar Apr 08 '22 15:04 wzhang42

Hi @wzhang42,

Thank you for your interests in our tool. For a large file, this is typical for subreads.bam from Sequel II, indeed it should take much longer time and larger RAM. Because LongQC computes overlap between subsampled reads (5k by default) and all reads, the computational time is actually a bit proportional to input size.

Having said, the real cause seems to be different in your case. From a file name, I assume you applied a new file format in Sequel II (most likely Sequel IIe) so-called reads.bam. This is a new file format introduced recently, and actually LongQC ver. 1.2.0c is not yet able to handle this format (an item on my plate). reads.bam has HiFi, CCS, and subreads, and you would need to extract just HiFi reads from your file.

This doc might be of your help: https://ccs.how/faq/reads-bam.html More specifically, you can follow this tutorial to get HiFi reads from reads.bam. Once you get HiFi reads from reads.bam, please use -x pb-hifi option for the new file. then, I believe this issue will dissappear.

I hope this helps.

Yoshinori

yfukasawa avatar Apr 10 '22 14:04 yfukasawa

Hi, Yoshinori, Many thanks for your reply. Yes, our data is reads.bam from Pacbio Sequel. I can use bam2fastq to convert the reads.bam to .fastq. For some small size file, it's ok either I use reads.bam or the converted .fastq. I felt that the main issue is the file size (file size corresponding to the read number, the aligned .bam can be as large as 900G). I agree that it's one possible option to convert the large reads.bam to ccs.bam/hifi_ccs.bam, which will reduce the read number and file size. But we are now interested in the LongQC metrics of the original reads.bam. I am interested in the parameter of subsampled reads, which I believe should can significantly reduce the required RAM and running time. Could you share with us the command line to configure this parameter(option)? I just guess that it's " -n NSAMPLE" (0-10000), default with 5000, right? If my guess is correct, then if I configure this parameter from the default 5000 to 1000, the running time will be linearly reduced to 1/5 of the original running time, Right? Additionally, whether the longQC sampleqc -p nNODE for one large .bam can be parallelized and the running time can be linearly reduced ?
Thank you again. Much appreciated!
Wenchao

wzhang42 avatar Apr 14 '22 21:04 wzhang42

Hi Wenchao,

If my guess is correct, then if I configure this parameter from the default 5000 to 1000

Generally speaking, yes. Actual computation time relies on minimizers in your dataset, but in any case, it will be reduced.

can be parallelized and the running time

bottleneck would be minimap2 computation in the flow, and minimap2 is not designed to parallelize in multiple nodes. technically, another layer can be added, but at least it's not available in the current version.

But we are now interested in the LongQC metrics of the original reads.bam.

Thank you for your interest in our tools in this regard. This is actually a new challenge, and reasons I recommended extraction of HiFi is not just data size. Most of mapping software tools assume a single data profile, and this is common for output file from machines. but, as long as I know, reads.bam is the very first exception. -x pb-hifi and -x pb-sequel use different parameters, and either profile doesn't work well for the other profile. in reads.bam, multiple types of reads requiring different params actually co-exist to reduce file size (personal communication). Either -x pb-hifi or -x pb-sequel would not return proper statistics for a whole reads.bam.

You can also extract non-hifi reads from reads.bam, and run our tool against them using pb-sequel mode. I expect it would have some bias as non-hifi reads are results of worse sequencing in ZMWs for a HiFi run. Stats would look worse than the real condition. I haven't tested this yet, but for now this is another technical reason I recommended to extract HiFi.

I hope this helps.

Yoshinori

yfukasawa avatar Apr 17 '22 07:04 yfukasawa