chewBBACA
chewBBACA copied to clipboard
problems with schema creating
Hi :)
I had such problem while creating a new scheme on 200 ref sequences from NCBI. I have no idea how to solve it. I'll be grateful for help.
Kasia
CPU cores: 22 BLAST Score Ratio: 0.6 Translation table: 11 Minimum sequence length: 201 Size threshold: 0.2 Word size: 5 Window size: 5 Clustering similarity: 0.2 Representative filter: 0.9 Intra-cluster filter: 0.9 Number of inputs: 201
Predicting gene sequences...
[====================] 100%
Extracting coding sequences...
[ ] 0% Error on cds_batch_extractor: Traceback (most recent call last): File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper results = input_args-1 File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 235, in cds_batch_extractor total = save_extracted_cds(g, identifier, orf_file_path, File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 182, in save_extracted_cds genome_info = extract_genome_cds(reading_frames, contigs, 1) File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 99, in extract_genome_cds sequence = contigs[contig_id] KeyError: 'NZ_CP031775'
[====================] 100%Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/bin/chewBBACA.py", line 10, in
Greetings @KasiaTluscik,
Thank you for your interest in chewBBACA. Can you tell us what version of chewBBACA you are using and what was the command? Also, you refer that you want to create a schema based on 200 genome assemblies from the NCBI, but chewBBACA is detecting 201 input files. If you only have 200 genome assemblies to pass as input, I ask you to make sure that there are not other files in the directory that might be causing this issue. Looking forward to hearing from you.
Rafael
Thanks for the quick reply. I meant ,,about 200" :) . There are actually 201 FASTA sequences. The program version is: 2.8.5. Command below: chewBBACA.py CreateSchema -i ~/e_paz/cgMLST/referencje_cgMLST/ -o ~/e_paz/cgMLST/shewanella_cgMLST_scheme --n shewanella_cgMLST_scheme --cpu 22 --ptf ~/e_paz/cgMLST/shewanellatrainedfile.trn
@KasiaTluscik thanks for the clarification and for posting the command. Can you make sure none of the files in the directory are corrupted nor have unusual characters? Also, how did you install chewBBACA? Looking forward to hearing from you. Mario
I've checked all files with seqkit stats. Seems to be ok. Maybe I should validate this data with something else? Can you give me an idea? I installed chewBBACA via conda.
@KasiaTluscik, we know that in some cases this type of error was solved by redownloading the genome assemblies or by running a script, such as sequence_cleaner.py from BioPython, to filter out low quality sequences and solve file format issues. We have tried to reproduce this type of error by downloading the same set of genome assemblies that we were told were leading to the issue, but the process always completed successfully on our side. That is why we think it is an issue related with one or several of the input files that may be corrupted. If it is not much to ask, can you ZIP the whole directory that contains the genome assemblies and send us the data to [email protected]? It might help us to identify the cause of the issue, after which we can share a solution and implement a verification to detect this issue during execution.
Hi,
Thanks for the hint, but it didn't solve the problem. I downloaded everything again and it doesn't work. I ran sequences through the suggested script (sequence_cleaner.py) and ... still the same error ....I enclose a link to download my dataset (the file is too large for e-mail). https://we.tl/t-0uEY0SEoJ4 This is new set with 211 fasta.
Can you try to run prodigal independently on your dataset? Just to check if it runs an error on any of the fastas
Hello @KasiaTluscik,
Thank you for sharing the dataset with us. We used the files to reproduce the error and found what is causing the issue. The dataset includes multiple versions for three RefSeq assemblies. The assemblies are the following:
GCF_002209245 (GCF_002209245.1_ASM220924v1_genomic.fna and GCF_002209245.2_ASM220924v2_genomic.fna) GCF_007004545 (GCF_007004545.2_ASM700454v2_genomic.fna and GCF_007004545.3_ASM700454v3_genomic.fna) GCF_007923045 (GCF_007923045.2_ASM792304v3_genomic.fna and GCF_007923045.3_ASM792304v4_genomic.fna)
chewBBACA does not expect to find multiple assemblies with the same unique prefix (it selects all characters before the first "." as the unique identifier). You can find more details in an issue that I have created to describe the problem and suggest a solution, #108. I suggest that you only keep the latest version for each assembly. The NCBI replaced/suppressed the previous versions. To do that you only need to remove the following files:
GCF_002209245.1_ASM220924v1_genomic.fna GCF_007004545.2_ASM700454v2_genomic.fna GCF_007923045.2_ASM792304v3_genomic.fna
The CreateSchema process should run without issues after removing those files. If you really need to include multiple versions, you will have to rename the files to ensure that the prefixes are different. Once again, thank you for sharing the dataset with us. It allowed us to identify the issue and we will include a fix in the next version. Please let us know if you can create the schema after removing or renaming the files.
Rafael
Hello :) I had an unexpected quarantine pause, but now I'm on the job. Your suggestion actually solved my problem. Now everything works fine. Thank you for your help. One more question. I've created a schema containing 46039 loci. Isn't that a lot? When I look at schemas for other bacterias, it's usually only a few thousand loci... greetings Kasia
Hello @KasiaTluscik,
It is great to know that we could solve this problem. If I recall correctly, your dataset includes several species for the Shewanella genus. The schemas that have a few thousand loci are usually created with datasets for a single species, although the number of loci in a schema can vary greatly based on the species and dataset composition. A schema with over 40k loci is certainly possible if you are working at genus-level and there is enough interspecific diversity. However, I would suggest that you check if the genome assemblies that you have selected are correctly classified, as the inclusion of misclassified strains can considerably increase the number of loci in a schema and affect the determination of the core-genome.
Rafael
Indeed Kasia, great to know that it worked out. In addition to what Rafael already pointed out I would like to remind you that the CreateSchema process always creates a wgMLST schema. To identify the cgMLST schema you should run the allele call on a set of suitable isolates and then identify which loci are present in the isolates at your desired frequency (100%, 95%, 90%,...). I suspect that if you do this the now with the schema you have the cgMLST schema will come down to a few hundred genes, given the diversity of species you are including. Do let us know how it goes.
Hello you could help me.
I'm trying to run the command but I get this chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6 If 'chewBBACA.py' is not a typo you can use command-not-found to lookup the package that contains it, like this: cnf chewBBACA.py cnf chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6
0, in molecule_type
as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in molecule_type
as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
0, in molecule_type
as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in molecule_type
as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
Hello @ocarabali,
The error you report seems to be related with the Bio.Alphabet module that was removed in Biopython 1.78. We updated chewBBACA's code based on the recommendations from the Biopython developers and this issue should not affect the latest version. Can you please check if the version you are using is 2.8.5? Please update chewBBACA if the version that is installed is not the latest. Let us know if this solves the issue.
Rafael
Dear Rafael Mamede
I already solved the previous error.
now i have the following error.
could you help me?
Error on translate_coding_sequences: Traceback (most recent call last): File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper results = input_args-1 File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/sequence_manipulation.py", line 526, in translate_coding_sequences cds_index = SeqIO.index(sequences_file, 'fasta') File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/SeqIO/init.py", line 875, in index return _IndexedSeqFileDict( File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/File.py", line 199, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key '0000-0056-5600'
chewBBACA version: 2.8.5
It seems that your schema was created with chewBBACA 2.1.0 or lower. It is highly recommended that you run the PrepExternalSchema process to guarantee full compatibility with the new chewBBACA version. If you wish to continue, the AlleleCall process will convert the schema to v2.8.5, but will not determine if schema structure respects configuration values.
I've got the same problem with version 3.1.2 Provided path does not include all the necessary schema files. Please verify that you have passed the correct path to the schema. could you help me? I've checked for same assemblies names and I didn't got...
Did you use the PrepExternalSchema module to convert your schema to a format usable by chewBBACA 3.1.2? What is the origin of the schema you are trying to use?
Hi Ramirma, thanks for your quick reply, I used the command: chewBBACA.py CreateSchema -i Genomes -o schema --cpu 16 --ptf Salmonella_enterica.trn to generate the schema, it didn't return me any error message. thanks!
Thank you for the clarification @massiizsve . What command did you use to do allele calling? Did you do allele calling on the same genomes you used to create the schema? Mario
Hello @massiizsve,
Just to add another check to what @ramirma has already asked. Please verify that you have passed the path to the schema directory to perform allele calling. The schema directory contains the schema FASTA files and a folder named short
that includes the FASTA files with the representative alleles. Based on the command you used to create the schema, the schema directory should be schema/schema_seed
. chewBBACA prints the warning you shared if it cannot find the FASTA files and the short
folder in the path you provided.
Rafael
Closing this issue. We've added many changes that solve or help with the issues reported here. Please let us know if you need anything else.