SnpEff
SnpEff copied to clipboard
problems with building database for CriGri1.0
I am trying to build a database using the files that i downloaded from the NCBI page of CriGri1.0. No matter what i do, everything fails. I tried to run the buildDbNcbi
script with any possible accession n:
./scripts/buildDbNcbi.sh AFTD00000000.1
./scripts/buildDbNcbi.sh GCF_000223135.1
./scripts/buildDbNcbi.sh DQ390542.2
./scripts/buildDbNcbi.sh GCA_000223135.1
but it doesn't work:
Downloading genome AFTD00000000.1
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2621 0 2621 0 0 2857 0 --:--:-- --:--:-- --:--:-- 2858
00:00:00 SnpEff version SnpEff 5.2a (build 2023-10-24 14:24), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'AFTD00000000.1'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AFTD00000000.1'
00:00:00 Reading config file: /home/user/genomics/snpEff/snpEff.config
00:00:00 done
00:00:00 Chromosome: 'AFTD00000000.1' length: 265786
00:00:00 Create exons from CDS (if needed):
00:00:00 Exons created for 0 transcripts.
00:00:00 Deleting redundant exons (if needed):
00:00:00 Total transcripts with deleted exons: 0
00:00:00 Collapsing zero length introns (if needed):
00:00:00 Total collapsed transcripts: 0
00:00:00 No sequence found in feature file.
00:00:00 Trying fasta file '/home/user/genomics/snpEff/./data/genomes/AFTD00000000.1.fa'
00:00:00 Trying fasta file '/home/user/genomics/snpEff/./data/AFTD00000000.1/sequences.fa'
java.lang.RuntimeException: Cannot find sequence for 'AFTD00000000.1'
at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.sequence(SnpEffPredictorFactoryFeatures.java:533)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:357)
at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:414)
at org.snpeff.SnpEff.run(SnpEff.java:1173)
at org.snpeff.SnpEff.main(SnpEff.java:163)
java.lang.RuntimeException: Error reading file '/home/user/genomics/snpEff/./data/AFTD00000000.1/genes.gbk'
java.lang.RuntimeException: Cannot find sequence for 'AFTD00000000.1'
at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:365)
at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:414)
at org.snpeff.SnpEff.run(SnpEff.java:1173)
at org.snpeff.SnpEff.main(SnpEff.java:163)
00:00:00 Logging
00:00:01 Checking for updates...
00:00:02 Done.
I also tried downloading proteins, cds, genome, and whatnot from the same page. I built the structure folder as snpEff would expect:
data/crigri_homemade:
-rw-r--r-- 1 root root 82237037 Mar 11 14:30 cds.fa
-rw------- 1 root root 395139654 Mar 8 08:45 genes.gtf
-rw------- 1 root root 26083171 Mar 8 08:45 protein.fa
data/genomes:
-rw------- 1 root root 2442407538 Mar 8 08:44 crigri_homemade.fa
i set the snpEff.config
file:
# Hamster genome, version crigri_homemade
crigri_homemade.genome : Hamster
and of course i run the command:
java -Xmx20g -jar snpEff.jar build -v crigri_homemade
however it is not working:
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243686.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243688.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243691.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243713.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243715.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243717.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243735.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243744.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243760.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243772.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243773.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243777.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243780.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243795.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243797.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243807.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243809.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243811.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_007648377.2'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_016830633.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_027298108.2'
bla bla bla...
CDS check: crigri_homemade OK: 0 Warnings: 0 Not found: 27418 Errors: 0 Error percentage: NaN%
FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
'NW_003613619.1" db_xref "GeneID:100759954'
'NW_003614478.1" db_xref "GeneID:100762652'
'NW_003613918.1" db_xref "GeneID:107978466'
'NW_003613918.1" db_xref "GeneID:107978467'
'NW_003617297.1" db_xref "GeneID:100768977'
'NW_003615411.1" db_xref "GeneID:100759186'
'NW_003613650.1" db_xref "GeneID:100759390'
'NW_003614457.1" db_xref "GeneID:100763890'
'NW_003613788.1" db_xref "GeneID:100766875'
'NW_003613849.1" db_xref "GeneID:100689475'
'NW_003615626.1" db_xref "GeneID:100759475'
'NW_003613838.1" db_xref "GeneID:100755459'
'NW_003614224.1" db_xref "GeneID:100682536'
'NW_003614410.1" db_xref "GeneID:100768549'
'NW_003614314.1" db_xref "GeneID:100766714'
'NW_003614224.1" db_xref "GeneID:100682535'
'NW_003613591.1" db_xref "GeneID:100755618'
'NW_003613780.1" db_xref "GeneID:107978463'
'NW_003613617.1" db_xref "GeneID:100751799'
'NW_003613842.1" db_xref "GeneID:100763667'
'NW_003613985.1" db_xref "GeneID:107978461'
'NW_003614852.1" db_xref "GeneID:100772877'
Transcript IDs from database (fasta file):
'lcl|NW_003613984.1_cds_XP_003503656.1_15848'
'lcl|NW_003613931.1_cds_XP_035312148.1_14739'
'2_24447'
'lcl|NW_003616314.1_cds_XP_016818869.1_34492'
'2_24448'
'lcl|NW_003613773.1_cds_XP_035310189.1_9427'
'lcl|NW_003613894.1_cds_XP_027292366.1_13615'
'lcl|NW_003613677.1_cds_XP_027261778.1_5695'
'lcl|NW_003613953.1_cds_XP_003503305.2_15188'
'1_cds_XP_016836244'
'1_cds_XP_016836245'
'lcl|NW_003616859.1_cds_XP_003514747.1_35474'
'lcl|NW_003614171.1_cds_XP_003505645.1_19382'
'lcl|NW_003613833.1_cds_XP_007640034.1_11333'
'lcl|NW_003614145.1_cds_XP_035313620.1_19012'
'lcl|NW_003613761.1_cds_XP_003499981.1_9158'
'1_cds_XP_016836264'
'lcl|NW_003613959.1_cds_XP_016831556.1_15340'
'lcl|NW_003613745.1_cds_XP_027290471.1_8486'
'lcl|NW_003616774.1_cds_XP_007652885.1_35345'
'lcl|NW_003619270.1_cds_XP_035294329.1_37044'
'1_cds_XP_007637008'
i am not using any exotic file i modified, or anything weird. i downloaded the files from NCBI and tried to build the db. i am stuck.
I also think that the CriGri_1.0.99.genome
provided with snpEff may be the same i am trying to build but i can't confirm it. If it is, i still have a problem because the annotation is not working for my vcf files (vcf files that were obtained using varscan from sequences aligned to the reference genome i linked at the very beginning).
All the entries in the vcf files that i get by running:
java -Xmx8g -jar /home/user/genomics/snpEff/snpEff.jar CriGri_1.0.99 /home/user/data/varscan_results/somatic/varscan_output.indel.vcf > /home/user/data/varscan_results/somatic/varscan_indel_annotations.vcf
state that:
ERROR_CHROMOSOME_NOT_FOUND
I suppose this is coming from the fact that, according to the ncbi page, there are no chromosome information regarding this assembly. but then what? shouldn't i use this genome at all?
This is the snpEff version:
00:00:00 SnpEff version SnpEff 5.2a (build 2023-10-24 14:24), by Pablo Cingolani
Any idea may help!