AGAT Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl)

Hello,

I'm trying to run the following command:

agat_sp_extract_sequences.pl -g JU2526_Y39G10AR.22.gff -f JU2526*_region.fa -p

And it throws the following error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the file must be less than 65,536 characters. Line 2 is 67824 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/Root/Root.pm:447
STACK: Bio::DB::IndexedBase::_check_linelength /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:757
STACK: Bio::DB::Fasta::_calculate_offsets /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/Fasta.pm:227
STACK: Bio::DB::IndexedBase::_index_files /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:659
STACK: Bio::DB::IndexedBase::index_file /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:487
STACK: Bio::DB::IndexedBase::new /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:364
STACK: /home/lgs6452/.conda/envs/exonerate_env/bin/agat_sp_extract_sequences.pl:125
-----------------------------------------------------------

It would appear the use of BioPerl means that your scripts won't accept single-line FASTAs with sequences longer than 65kb. Would it be possible to do pre-processing (ie converting from single-line to multi-line) of the FASTAs within your scripts so that they work regardless of the input format? While it's straightforward enough to convert the FASTA file prior to running your scripts, it would be far more straightforward to have it done by the script itself. Would probably save you a tonne of time with confused users, too.

Thanks,

Lewis

PS: I've only begun using AGAT but it seems like it will largely solve the constant pain of working with GFF3 files. Huge thanks for developing it!

Jul 01 '20 10:07 lstevens17

Incidentally, if anyone bumps into the same issue, you can use FASTX-Toolkit to reformat your FASTA (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fasta_formatter_usage). It can be installed using conda.

# install with conda
conda install -c bioconda fastx_toolkit

# convert (where 60 = desired line length)
fasta_formatter -i [original.fasta] -w 60 >[new.fasta]

Jul 01 '20 11:07 lstevens17

Yes I could add a patch to reformat the Fasta file in such case, but I would prefer that this type of fix is hold within Bioperl directly. If your header in shorter than 80 character you could also directly use a bash command: fold input.fa > output.fa

Jul 01 '20 11:07 Juke34

See here for discussion with bioperl team: https://github.com/bioperl/bioperl-live/issues/345

Sep 07 '20 09:09 Juke34

I see bioperl does not have a plan to fix this issue. Here is a Perl alternative of the fastx_toolkit written by Ning Jiang: https://github.com/oushujun/LTR_retriever/blob/master/bin/fasta-reformat.pl. It's slower but free of third-party dependencies.

May 02 '22 15:05 oushujun