SnpEff icon indicating copy to clipboard operation
SnpEff copied to clipboard

problems with building database for CriGri1.0

Open gabrielet opened this issue 11 months ago • 0 comments

I am trying to build a database using the files that i downloaded from the NCBI page of CriGri1.0. No matter what i do, everything fails. I tried to run the buildDbNcbi script with any possible accession n:

./scripts/buildDbNcbi.sh AFTD00000000.1 ./scripts/buildDbNcbi.sh GCF_000223135.1 ./scripts/buildDbNcbi.sh DQ390542.2 ./scripts/buildDbNcbi.sh GCA_000223135.1

but it doesn't work:

Downloading genome AFTD00000000.1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2621    0  2621    0     0   2857      0 --:--:-- --:--:-- --:--:--  2858
00:00:00 SnpEff version SnpEff 5.2a (build 2023-10-24 14:24), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'AFTD00000000.1'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AFTD00000000.1'
00:00:00 Reading config file: /home/user/genomics/snpEff/snpEff.config
00:00:00 done
00:00:00 Chromosome: 'AFTD00000000.1'	length: 265786
00:00:00 Create exons from CDS (if needed): 
00:00:00 Exons created for 0 transcripts.
00:00:00 Deleting redundant exons (if needed): 
00:00:00 	Total transcripts with deleted exons: 0
00:00:00 Collapsing zero length introns (if needed): 
00:00:00 	Total collapsed transcripts: 0
00:00:00 No sequence found in feature file.
00:00:00 	Trying fasta file '/home/user/genomics/snpEff/./data/genomes/AFTD00000000.1.fa'
00:00:00 	Trying fasta file '/home/user/genomics/snpEff/./data/AFTD00000000.1/sequences.fa'
java.lang.RuntimeException: Cannot find sequence for 'AFTD00000000.1'
	at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.sequence(SnpEffPredictorFactoryFeatures.java:533)
	at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:357)
	at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:414)
	at org.snpeff.SnpEff.run(SnpEff.java:1173)
	at org.snpeff.SnpEff.main(SnpEff.java:163)
java.lang.RuntimeException: Error reading file '/home/user/genomics/snpEff/./data/AFTD00000000.1/genes.gbk'
java.lang.RuntimeException: Cannot find sequence for 'AFTD00000000.1'
	at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:365)
	at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:414)
	at org.snpeff.SnpEff.run(SnpEff.java:1173)
	at org.snpeff.SnpEff.main(SnpEff.java:163)
00:00:00 Logging
00:00:01 Checking for updates...
00:00:02 Done.

I also tried downloading proteins, cds, genome, and whatnot from the same page. I built the structure folder as snpEff would expect:

data/crigri_homemade:
-rw-r--r-- 1 root root  82237037 Mar 11 14:30 cds.fa
-rw------- 1 root root 395139654 Mar  8 08:45 genes.gtf
-rw------- 1 root root  26083171 Mar  8 08:45 protein.fa

data/genomes:
-rw------- 1 root root 2442407538 Mar  8 08:44 crigri_homemade.fa

i set the snpEff.config file:

# Hamster genome, version crigri_homemade
crigri_homemade.genome : Hamster

and of course i run the command:

java -Xmx20g -jar snpEff.jar build -v crigri_homemade

however it is not working:

WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243686.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243688.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243691.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243713.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243715.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243717.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243735.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243744.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243760.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243772.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243773.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243777.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243780.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243795.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243797.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243807.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243809.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'NP_001243811.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_007648377.2'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_016830633.1'
WARNING_RARE_AA_POSSITION_NOT_FOUND: Cannot find transcript 'XP_027298108.2'

bla bla bla...

CDS check:	crigri_homemade	OK: 0	Warnings: 0	Not found: 27418	Errors: 0	Error percentage: NaN%
FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
	'NW_003613619.1" db_xref "GeneID:100759954'
	'NW_003614478.1" db_xref "GeneID:100762652'
	'NW_003613918.1" db_xref "GeneID:107978466'
	'NW_003613918.1" db_xref "GeneID:107978467'
	'NW_003617297.1" db_xref "GeneID:100768977'
	'NW_003615411.1" db_xref "GeneID:100759186'
	'NW_003613650.1" db_xref "GeneID:100759390'
	'NW_003614457.1" db_xref "GeneID:100763890'
	'NW_003613788.1" db_xref "GeneID:100766875'
	'NW_003613849.1" db_xref "GeneID:100689475'
	'NW_003615626.1" db_xref "GeneID:100759475'
	'NW_003613838.1" db_xref "GeneID:100755459'
	'NW_003614224.1" db_xref "GeneID:100682536'
	'NW_003614410.1" db_xref "GeneID:100768549'
	'NW_003614314.1" db_xref "GeneID:100766714'
	'NW_003614224.1" db_xref "GeneID:100682535'
	'NW_003613591.1" db_xref "GeneID:100755618'
	'NW_003613780.1" db_xref "GeneID:107978463'
	'NW_003613617.1" db_xref "GeneID:100751799'
	'NW_003613842.1" db_xref "GeneID:100763667'
	'NW_003613985.1" db_xref "GeneID:107978461'
	'NW_003614852.1" db_xref "GeneID:100772877'
Transcript IDs from database (fasta file):
	'lcl|NW_003613984.1_cds_XP_003503656.1_15848'
	'lcl|NW_003613931.1_cds_XP_035312148.1_14739'
	'2_24447'
	'lcl|NW_003616314.1_cds_XP_016818869.1_34492'
	'2_24448'
	'lcl|NW_003613773.1_cds_XP_035310189.1_9427'
	'lcl|NW_003613894.1_cds_XP_027292366.1_13615'
	'lcl|NW_003613677.1_cds_XP_027261778.1_5695'
	'lcl|NW_003613953.1_cds_XP_003503305.2_15188'
	'1_cds_XP_016836244'
	'1_cds_XP_016836245'
	'lcl|NW_003616859.1_cds_XP_003514747.1_35474'
	'lcl|NW_003614171.1_cds_XP_003505645.1_19382'
	'lcl|NW_003613833.1_cds_XP_007640034.1_11333'
	'lcl|NW_003614145.1_cds_XP_035313620.1_19012'
	'lcl|NW_003613761.1_cds_XP_003499981.1_9158'
	'1_cds_XP_016836264'
	'lcl|NW_003613959.1_cds_XP_016831556.1_15340'
	'lcl|NW_003613745.1_cds_XP_027290471.1_8486'
	'lcl|NW_003616774.1_cds_XP_007652885.1_35345'
	'lcl|NW_003619270.1_cds_XP_035294329.1_37044'
	'1_cds_XP_007637008'

i am not using any exotic file i modified, or anything weird. i downloaded the files from NCBI and tried to build the db. i am stuck.

I also think that the CriGri_1.0.99.genome provided with snpEff may be the same i am trying to build but i can't confirm it. If it is, i still have a problem because the annotation is not working for my vcf files (vcf files that were obtained using varscan from sequences aligned to the reference genome i linked at the very beginning).

All the entries in the vcf files that i get by running:

java -Xmx8g -jar /home/user/genomics/snpEff/snpEff.jar CriGri_1.0.99 /home/user/data/varscan_results/somatic/varscan_output.indel.vcf > /home/user/data/varscan_results/somatic/varscan_indel_annotations.vcf

state that:

ERROR_CHROMOSOME_NOT_FOUND

I suppose this is coming from the fact that, according to the ncbi page, there are no chromosome information regarding this assembly. but then what? shouldn't i use this genome at all?

This is the snpEff version:

00:00:00 SnpEff version SnpEff 5.2a (build 2023-10-24 14:24), by Pablo Cingolani

Any idea may help!

gabrielet avatar Mar 11 '24 13:03 gabrielet