grenedalf icon indicating copy to clipboard operation
grenedalf copied to clipboard

error regarding vcf file from PoolSNP

Open JennyCNS opened this issue 1 year ago • 3 comments

Hello,

I am trying to run grenedalf frequency on my vcf file generated by PoolSNP from an mpileup file using

GRENEDALF=/gxfs_home/geomar/smomw573/software/grenedalf/bin/grenedalf
VCF=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf.gz
GENOME=/gxfs_work1/geomar/smomw573/seasonal_adaptation/genome/Eaffinis.Baltic.PseudoRef.Mar22.fasta \
OUTPUT=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/grenedalf-freq


$GRENEDALF frequency --vcf-path $VCF --allow-file-overwriting --reference-genome-fasta-file $GENOME --write-sample-alt-freq > $OUTPUT/grenedalftrial.log

and got the following error:

Cannot iterate over VCF file /gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf using the "AD" FORMAT field to count allelic depths, as that field is not part of the VCF file.

PoolSNP output vcf head

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  EA_2009_T1      EA_2009_T2      EA_2009_T3      EA_2009_T4      EA_2011_T1      EA_2011_T2    EA_2015_T1      EA_2015_T2      EA_2015_T3      EA_2015_T4      EA_2022_T1      EA_2022_T2      EA_2022_T3      EA_2022_T4
TRINITY_DN17_c1_g1_i1   185     .       G       A       .       .       ADP=82.42857142857143;NC=0      GT:RD:AD:DP:FREQ        ./.:.:.:.:.     0/1:51:1:52:0.02      0/1:71:3:74:0.04        0/1:94:3:97:0.03        0/1:64:2:66:0.03        0/1:107:2:109:0.02      0/1:109:2:111:0.02      0/0:91:0:91:0.0 0/1:48:2:50:0.04      0/0:96:0:96:0.0 0/1:99:2:101:0.02       0/1:92:1:93:0.01        0/1:79:3:82:0.04        0/1:100:2:102:0.02
TRINITY_DN17_c1_g1_i1   190     .       C       G       .       .       ADP=83.92857142857143;NC=0      GT:RD:AD:DP:FREQ        ./.:.:.:.:.     0/1:48:3:51:0.06      0/1:73:4:77:0.05        0/1:96:4:100:0.04       0/1:67:1:68:0.01        0/1:105:5:110:0.05      0/1:108:2:110:0.02      0/1:92:3:95:0.03        0/1:48:2:50:0.04      0/1:94:4:98:0.04        0/1:104:2:106:0.02      0/1:87:4:91:0.04        0/1:82:3:85:0.04        0/0:102:0:102:0.0

The vcf file was then formated with the following python script:

# Open the VCF file in read mode
with gzip.open(vcf_file_path, 'rt', encoding='utf-8') as vcf_file:
    # Iterate through each line in the file
    with open(output_file_path, 'a', encoding='utf-8') as output_file:
        line_number = 0

        for line in vcf_file:
            if line.startswith('#'):
                continue

            # Increment the line number (Move this line outside the loop)
            # line_number += 1
            # Split the line into fields based on tabs
            fields = line.strip().split('\t')
            # 9th column forward
            fsplit = fields[8].split(':')
            modified_fields = ':'.join([fsplit[0], fsplit[2], fsplit[3], fsplit[4]])
            fields[8] = modified_fields

            idx = 9
            for fcolumn in fields[9:]:
                tmpsplit = fcolumn.split(':')
                tmpjoin = ':'.join([tmpsplit[0], ','.join([tmpsplit[1], tmpsplit[2]]), ':'.join([tmpsplit[3], tmpsplit[4]])])
                fields[idx] = tmpjoin
                idx += 1

            # Write the modified line to the output file
            output_file.write('\t'.join(fields) + '\n')

            line_number += 1
            #print(line_number)

            # Print line number every 50,000 lines
            if line_number % 50000 == 0:
                print(f"Processed {line_number} lines.")

and we corrected the AD field in the vcf header

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

and now grenedalf frequency in running.

Many thanks,

Jennifer

JennyCNS avatar Jul 31 '23 09:07 JennyCNS