SNPGenie icon indicating copy to clipboard operation
SNPGenie copied to clipboard

error in running the SNPGenie

Open shiyi-pan opened this issue 3 years ago • 16 comments

hi, i run the SNPGenie and got warning like that :

WARNING: The CDS coordinates for gene Glyma.01G115000.3.Wm82.a2.v1.CDS.2 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

When I remove the wrong CDS manually , the SNPGenie works again .but my gft file is too huge to remove all wrong CDSs manually. could you tell me how to judge a CDS in a gft file could be translated into complete codons or not ,so that I can write a script and remove them from my gft file once at all. Thank you very.

shiyi-pan avatar Jul 16 '20 13:07 shiyi-pan

Greetings, Shiyi-Pan! Thanks very much for using SNPGenie.

Unfortunately, SNPGenie is explicitly designed so that the GTF file is a "user problem", in that the user must make sure (1) there is only one entry/version/transcript for each gene_id; and (2) each gene is a multiple of 3 bp (i.e., a complete codon set).

To write a filtering script, my first suggestion would be to calculate the nucleotide length of each line in the file, i.e., [end] - [start] + 1. Then you could group/summarize the lines according to unique gene_ids, and find the sum of the lengths for each gene_id. For example, in R, you might use group_by() and summarize() — or something similar in Python. Once you have a list of gene_ids that do not add to a multiple of 3, you can exclude lines containing them.

However, this all assumes that the GTF file is fixed. It might be better to go back a step and figure out why there are so many genes that DON'T sum to a multiple of 3, as this could be indicative of a larger problem.

Let me know if this helps!

Yours, Chase

singing-scientist avatar Jul 18 '20 05:07 singing-scientist

Chase: thanks vary much , I will inspect my gtf file according to your suggestion.

shiyi-pan avatar Jul 19 '20 12:07 shiyi-pan

No problem at all! Please let me know how it turns out — I will leave this thread open for now.

singing-scientist avatar Jul 21 '20 06:07 singing-scientist

hi, I run my SNPGenie with your suggestion and don't get right result.my population_summary.txt is like that:

file sites sites_coding sites_noncoding pi pi_coding pi_noncoding N_sites S_sites piN piS mean_dN_vs_ref mean_dS_vs_ref mean_ gdiv_polymorphic mean_N_gdiv mean_S_gdiv mean_gdiv sites_polymorphic mean_gdiv_coding_poly sites_coding_poly mean_gdiv_non coding_poly sites_noncoding_poly chr01.recode.vcf 56831624 2829974 54001650 0 0 0 1742252 494230 0 0 0 0 * * * 00 * 0 * 0

and my site_results.txt file like that:

file product site ref_nt maj_nt position_in_codon overlapping_ORFs codon_start_site codon pi gdiv class_vs_ref class coverage A C G T . Is this software can't be used in Homozygous locus ??

shiyi-pan avatar Jul 23 '20 13:07 shiyi-pan

I'm afraid it is impossible for me to evaluate what is happening without more details. What is your goal for using SNPGenie? SNPGenie estimates πN and πS (or mean dN and dS), which are population/species-level measures of diversity/divergence, as described in the publications. Certainly, it will not work for only one homozygous locus (one individual?). If you need help with a specific issue, please provide an exact example with input files and command line arguments.

singing-scientist avatar Jul 24 '20 06:07 singing-scientist

test.zip thank you ,here is my data.

shiyi-pan avatar Jul 24 '20 09:07 shiyi-pan

Thank you. This VCF file only contains genotypes, but SNPGenie uses nucleotide frequency data. Each sample or AF (allele frequency) estimate in the VCF file should refer to a collection of individuals or a pooled sample, not a single individual. Is your intention to estimate diversity within a sample of a population? Before using SNPGenie, I recommend reading the SNPGenie papers and documentation, especially on the acceptable VCF file format. Alternatively, if you could carefully describe your data, I could try to give advice on the best approach.

singing-scientist avatar Jul 24 '20 09:07 singing-scientist

I‘m sorry I used wrong vcf file,now I add the AN and AF information into my VCF file and it report some warning like that: could you give me some advise?

WARNING: In final_chr01.vcf|Glyma.01G001900.Wm82.a2.v1|284862,

the nucleotide total (which should be 100.00%) is instead: 187.16%.

This should occur only when conflicting coverages have been reported.

Illegal division by zero at /ds3512/home/panyp/750cexu/ka-ks/SNPGenie/snpgenie.pl line 6725.

here is my files with ritght VCF file.

test.zip

here is my script:

#!/bin/bash

#$ -cwd #$ -j y #$ -S /bin/bash

#$ -l mem_free=4G

PERL=/ds3512/home/panyp/ruanjian/perl/bin/perl SNPGenie=/ds3512/home/panyp/750cexu/ka-ks/SNPGenie

$PERL $SNPGenie/snpgenie.pl --vcfformat=1 --snpreport=new_chr01.vcf --fastafile=Chr01.fa --gtffile=chr01.cds.gtf

I try to calculate the dN and dS in different genome region. Thank you very much.

shiyi-pan avatar Jul 24 '20 14:07 shiyi-pan

Hello Shiyi! Sorry for the trouble. The first error I get is this:

## WARNING: In new_chr01.vcf, the variant at site 206445,
## the variant data imply a negative number of T nucleotides: -138.253611556982. This most often results from
## variants which are assigned to the wrong site in the SNP Report, or **multiple copies of the same genes/exons in the GTF file**. Results at this site
## are unreliable; T count set to 0; proceed with caution.

When I look at the GTF file I do see four entries for the same gene, Glyma.01G001100.Wm82.a2.v1, from 206377-206462. For example:

Chr01 phytozomev10 CDS 206377 206462 . + 0 transcript_id "Glyma.01G001100.3.Wm82.a2.v1"; gene_id "Glyma.01G001100.Wm82.a2.v1"; gene_name "Glyma.01G001100";

Thus, this error (and, I think, all others) is occurring because there is more than one entry in the GTF file for the same gene_id. It should work if you filter to just one transcript per gene, for example, the longest.

Let me know! Chase

singing-scientist avatar Aug 07 '20 05:08 singing-scientist

thank you for your reply. I have delete these genes and it still has this error. is't because the dS in gene is zeno ?

shiyi-pan avatar Aug 12 '20 14:08 shiyi-pan

I think that dS=0 is not likely to cause this error, because this error is about read depth of each allele. I would be happy to investigate if you can provide the exact input, as before.

singing-scientist avatar Aug 13 '20 07:08 singing-scientist

ok,thank you. here is the gtf file and you can use the VCF and fasta file i upload before.

files.zip

shiyi-pan avatar Aug 16 '20 02:08 shiyi-pan

This VCF format has changed and does not match --vcfformat=1. Thus, it is not appropriate to use the command from before:

$PERL $SNPGenie/snpgenie.pl --vcfformat=1 --snpreport=new_chr01.vcf --fastafile=Chr01.fa --gtffile=chr01.cds.gtf

Because I cannot guess what command you used, in order to help, I need you to provide the exact command and the exact input used.

singing-scientist avatar Aug 17 '20 08:08 singing-scientist

ok, I'm sorry for bother you a lot. here is the input files.thank you very. kaks.zip

shiyi-pan avatar Aug 17 '20 13:08 shiyi-pan

Dear @shiyi-pan: apologies for the delay; it took me some time to identify the problem. It appears your VCF file contains impossible allele frequencies. For example, at site 343991, the VCF file reports AN=1;AF=0.0016051364365971107. However, if only one allele is sequenced (AN=1), then it is not possible for there to be any variation, and there cannot be a minor allele with a frequency of 0.16% (AF). As a result, 2 alleles are being reported where SNPGenie expects only 1, which results in a negative number and/or division by 0.

I have updated SNPGenie to include a warning alerting users to this circumstance, in case it happens again in the future. Unfortunately, this is an error that occurs before SNPGenie is used, and there appears to be a problem with VCF generation or SNP calling. Please let me know if you find the source of the issue, or I have misunderstood.

Yours, Chase

singing-scientist avatar Aug 22 '20 19:08 singing-scientist

thank you for your help,it runs corrently now.thank your very much.Chase.

shiyi-pan avatar Aug 26 '20 01:08 shiyi-pan