LongQC
LongQC copied to clipboard
OSError: [Errno 5] Input/output error
Hi, yfukasawa,
I am running LongQC for a batch of my Pacbio unaligned BAMs. For the small samples, it run smoothly. But for one of a little bit large BAM file, it take a very long time and finally report OSE error. My command is
longqc sampleqc -x pb-sequel -p 8 -o HG002_90pM_read_LongQC ./m64304e_211014_201856.reads.bam
I cut the error related message as the following. Could you help me figure out the reason? Additionally, it take close two weeks to run this sample and get such an OSError, whether I can specific more nodes (such as -p 32) to linearly speed up? Thank you in advance.
Wenchao
File "/opt/LongQC/longQC.py", line 63, in main
args.handler(args)
File "/opt/LongQC/longQC.py", line 829, in command_sample
tpl = env.get_template('web_summary.tpl.html')
File "/opt/conda/lib/python3.9/site-packages/jinja2/environment.py", line 997, in get_template
File "/opt/conda/lib/python3.9/site-packages/jinja2/environment.py", line 958, in _load_template
File "/opt/conda/lib/python3.9/site-packages/jinja2/loaders.py", line 125, in load
File "/opt/conda/lib/python3.9/site-packages/jinja2/loaders.py", line 201, in get_source
OSError: [Errno 5] Input/output error
BTW, I run LongQC at LSF cluster. And the version is : LongQC 1.2.0c Wish to hear from you.
Hi @wzhang42,
Thank you for your interests in our tool. For a large file, this is typical for subreads.bam from Sequel II, indeed it should take much longer time and larger RAM. Because LongQC computes overlap between subsampled reads (5k by default) and all reads, the computational time is actually a bit proportional to input size.
Having said, the real cause seems to be different in your case. From a file name, I assume you applied a new file format in Sequel II (most likely Sequel IIe) so-called reads.bam
.
This is a new file format introduced recently, and actually LongQC ver. 1.2.0c is not yet able to handle this format (an item on my plate). reads.bam has HiFi, CCS, and subreads, and you would need to extract just HiFi reads from your file.
This doc might be of your help: https://ccs.how/faq/reads-bam.html
More specifically, you can follow this tutorial to get HiFi reads from reads.bam.
Once you get HiFi reads from reads.bam, please use -x pb-hifi
option for the new file. then, I believe this issue will dissappear.
I hope this helps.
Yoshinori
Hi, Yoshinori,
Many thanks for your reply. Yes, our data is reads.bam from Pacbio Sequel. I can use bam2fastq to convert the reads.bam to .fastq. For some small size file, it's ok either I use reads.bam or the converted .fastq. I felt that the main issue is the file size (file size corresponding to the read number, the aligned .bam can be as large as 900G). I agree that it's one possible option to convert the large reads.bam to ccs.bam/hifi_ccs.bam, which will reduce the read number and file size. But we are now interested in the LongQC metrics of the original reads.bam. I am interested in the parameter of subsampled reads, which I believe should can significantly reduce the required RAM and running time. Could you share with us the command line to configure this parameter(option)? I just guess that it's " -n NSAMPLE" (0-10000), default with 5000, right? If my guess is correct, then if I configure this parameter from the default 5000 to 1000, the running time will be linearly reduced to 1/5 of the original running time, Right? Additionally, whether the longQC sampleqc -p nNODE for one large .bam can be parallelized and the running time can be linearly reduced ?
Thank you again. Much appreciated!
Wenchao
Hi Wenchao,
If my guess is correct, then if I configure this parameter from the default 5000 to 1000
Generally speaking, yes. Actual computation time relies on minimizers in your dataset, but in any case, it will be reduced.
can be parallelized and the running time
bottleneck would be minimap2 computation in the flow, and minimap2 is not designed to parallelize in multiple nodes. technically, another layer can be added, but at least it's not available in the current version.
But we are now interested in the LongQC metrics of the original reads.bam.
Thank you for your interest in our tools in this regard. This is actually a new challenge, and reasons I recommended extraction of HiFi is not just data size. Most of mapping software tools assume a single data profile, and this is common for output file from machines. but, as long as I know, reads.bam
is the very first exception. -x pb-hifi
and -x pb-sequel
use different parameters, and either profile doesn't work well for the other profile. in reads.bam
, multiple types of reads requiring different params actually co-exist to reduce file size (personal communication). Either -x pb-hifi
or -x pb-sequel
would not return proper statistics for a whole reads.bam
.
You can also extract non-hifi reads from reads.bam, and run our tool against them using pb-sequel
mode. I expect it would have some bias as non-hifi reads are results of worse sequencing in ZMWs for a HiFi run. Stats would look worse than the real condition. I haven't tested this yet, but for now this is another technical reason I recommended to extract HiFi.
I hope this helps.
Yoshinori