chewBBACA
chewBBACA copied to clipboard
Blast Local IDs too long?
When I try to run chewbbaca createschema, I get the following error:
$ chewBBACA.py CreateSchema -i enterobacter-assembly-copies/ -o ./ecloacae-schema --n ecloacae-wgmlst-schema --ptf GCF_001875655.1.trn --cpu 24
chewBBACA version: 3.3.3
Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: [email protected]
============================
chewBBACA - CreateSchema
============================
Started at: 2024-04-02T15:20:00
Prodigal training file: GCF_001875655.1.trn
Prodigal mode: single
CPU cores: 24
BLAST Score Ratio: 0.6
Translation table: 11
Minimum sequence length: 201
Size threshold: 0.2
Word size: 5
Window size: 5
Clustering similarity: 0.2
Representative filter: 0.9
Intra-cluster filter: 0.9
CDS prediction
================
Predicting CDSs for 481 inputs...
[====================] 100%
Extracted a total of 2278483 CDSs from 481 inputs.
CDS deduplication
===================
Identifying distinct CDSs...
Identified 539683 distinct CDSs.
CDS translation
=================
Translating 539683 CDS...
[====================] 100%
10041 CDSs could not be translated.
Protein deduplication
=======================
Identifying distinct proteins...
Identified 322764 distinct proteins.
Kept 322764 sequences after filtering the initial sequences.
Protein clustering
====================
Clustering proteins...
[====================] 100%
Clustered 322764 proteins into 27110 clusters.
Removing proteins highly similar to the cluster representative...
Removed 79907 sequences.
Identified 14106 singletons.
Remaining sequences after representative and singleton pruning: 242857
Removing sequences highly similar to other clustered sequences...
Removed 140800 sequences.
Clusters to BLAST: 13004
Performing all-vs-all BLASTp per cluster...
b'BLAST Database creation error: Near line 1, the local id is too long. Its length is 58 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long.\n'
This was using chewBBACA v3.3.3 (blast v2.15). My input sequences are simply from Unicycler and their contig headers follow the format:
>1 ... ... ...
>2 ... ... ...
>3 ... ... ...
There is a space between the contig number and the rest of the info.
Any help is appreciated. Conrad
Greetings @cizydorczyk,
That contig format should not raise issues. I think the issue you are encountering is related to the names of the input files. chewBBACA determines the unique prefix for each input file and includes that prefix in the name of each coding sequence (CDS) identified in the genome. The unique prefix is everything before the first .
in the filename. In this case, you might have input files with lengthy prefixes that lead to issues when running BLASTp. Please check the filenames to make sure that there are no prefixes with close to 50 characters. You can read more about input file naming here.
Please let us know if that solves the issue.
Kind regards,
Rafael
Thank you for your response.
I was not able to modify my file prefixes but I modified contig names within each assembly to just a number and it resolved the issue. I must have missed that filename lengths could contribute to this issue -- good to know for the future.
Conrad