bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

fix mentioned in #933 isn't working for 23andme skipping rows for NCBI reference

Open stereotypy opened this issue 3 years ago • 1 comments

Hi there, I was having the same problem in #933 and the fix below isn't working from the files from NCBI here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/

Both specifying the column orientation and making the CHROM identifiers the same b/t 23andme file and the ref.fa solved the issue

Make "chr" absent ref.fa cat GRCh37.p13.genome.fa | sed 's/>chr/>/g' > GRCh37.23andme.fa

Specify column order for bcftools convert bcftools convert -c ID,CHROM,POS,AA --tsv2vcf genome_p1.txt -f GRCh37.23andme.fa -s genomeP1 -Oz -o genomeP1.vcf.gz

Thanks!

I also tried indexing it with samtools faidx as suggested in #1076 but it's still skipping everything.

The fix for now for me was using Ensembl's data here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

I am new to this and don't know enough about the format issues in fasta files to understand why this is happening but a quick peek at the headers looks a bit different between NCBI and Ensembl. It would be cool to have something built into bcftools that could maybe put a flag for what type of reference data you're working with to avoid these problems.

stereotypy avatar May 05 '21 21:05 stereotypy

Since you are not showing any data, it is impossible to tell what went wrong. The issues you reference give some useful tips, not sure how to help more here.

pd3 avatar May 24 '21 14:05 pd3