cutadapt
cutadapt copied to clipboard
"OverflowError: FASTA/FASTQ record does not fit into buffer" when trimming ONT reads
Hi @marcelm
I'm using cutadapt 4.4 with python 3.10.12 and I'm stumbling into this error when trimming the ultra long ULK114 adapters from a specific ONT Promethion flowcell. I'm wondering whether it is related to it having a few megabase size reads.
This is a description of the content of the file:
[diego.terrones@clip-login-1 6890b2ec397f656fd26681dc2d5e9b]$ seqkit stat -a reads.filtered.fq.gz
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
reads.filtered.fq.gz FASTQ DNA 100,077 4,291,610,866 1,032 42,883.1 1,124,436 18,573 32,187 56,211 0 58,783 90.34 82.26 46.2
This is the command:
cutadapt --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz
This is the output:
This is cutadapt 4.4 with Python 3.10.12
Command line parameters: --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz
Processing single-end reads on 4 cores ...
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
for index, chunks in enumerate(self._read_chunks(*files)):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
for index, chunks in enumerate(self._read_chunks(*files)):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
for index, chunks in enumerate(self._read_chunks(*files)):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
for index, chunks in enumerate(self._read_chunks(*files)):
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer
Traceback (most recent call last):
File "/usr/local/bin/cutadapt", line 8, in <module>
sys.exit(main_cli())
File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1061, in main_cli
main(sys.argv[1:])
File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1131, in main
stats = run_pipeline(
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 469, in run_pipeline
statistics = runner.run()
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 350, in run
chunk_index = self._try_receive(connection)
File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 386, in _try_receive
raise e
OverflowError: FASTA/FASTQ record does not fit into buffer
Many thanks!
Hi, that’s interesting. By default, the largest FASTQ record may have 4 million bytes. This includes the quality values, so the maximum read length is about 2 Mbp. I thought this was enough ...
There is actually a hidden (and I believe undocumented) command-line option --buffer-size
that you can use to increase the buffer size. Either find out the largest read length, multiply by two and round it up a bit or try with increasingly larger sizes. For example, --buffer-size=16000000
would allow at most reads with approx. 8 Mbp.
Ah fantastic! I had found the corresponding line in your code and was about to edit it, but this is much more convenient.
I would say it is not rare to have reads of a few megabases with the ultra long protocols, so might be good to eventually increase the default for this buffer. I think a max read size of ~8 megabases should be pretty safe.
Thanks a lot!
I can confirm that --buffer-size=16000000
does the job
Awesome! Let me re-open this until I’ve found a more permanent solution. Maybe I can make the buffer size dynamic or so.
You could try the following pattern:
while True:
try:
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
pass
except OverFlowError:
self.buffer_size *= 2
logging.warning("Keep some RAM sticks at the ready!")
continue
else:
break # or return to escape the loop
The strategy is good, but just ignoring the exception and re-trying will lose the contents of the buffer. This would have to be done within read_chunks
directly.
Whoops, you are right. I incorrectly assumed blocks were passed rather than files.