grenedalf
grenedalf copied to clipboard
error regarding vcf file from PoolSNP
Hello,
I am trying to run grenedalf frequency on my vcf file generated by PoolSNP from an mpileup file using
GRENEDALF=/gxfs_home/geomar/smomw573/software/grenedalf/bin/grenedalf
VCF=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf.gz
GENOME=/gxfs_work1/geomar/smomw573/seasonal_adaptation/genome/Eaffinis.Baltic.PseudoRef.Mar22.fasta \
OUTPUT=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/grenedalf-freq
$GRENEDALF frequency --vcf-path $VCF --allow-file-overwriting --reference-genome-fasta-file $GENOME --write-sample-alt-freq > $OUTPUT/grenedalftrial.log
and got the following error:
Cannot iterate over VCF file /gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf using the "AD" FORMAT field to count allelic depths, as that field is not part of the VCF file.
PoolSNP output vcf head
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT EA_2009_T1 EA_2009_T2 EA_2009_T3 EA_2009_T4 EA_2011_T1 EA_2011_T2 EA_2015_T1 EA_2015_T2 EA_2015_T3 EA_2015_T4 EA_2022_T1 EA_2022_T2 EA_2022_T3 EA_2022_T4
TRINITY_DN17_c1_g1_i1 185 . G A . . ADP=82.42857142857143;NC=0 GT:RD:AD:DP:FREQ ./.:.:.:.:. 0/1:51:1:52:0.02 0/1:71:3:74:0.04 0/1:94:3:97:0.03 0/1:64:2:66:0.03 0/1:107:2:109:0.02 0/1:109:2:111:0.02 0/0:91:0:91:0.0 0/1:48:2:50:0.04 0/0:96:0:96:0.0 0/1:99:2:101:0.02 0/1:92:1:93:0.01 0/1:79:3:82:0.04 0/1:100:2:102:0.02
TRINITY_DN17_c1_g1_i1 190 . C G . . ADP=83.92857142857143;NC=0 GT:RD:AD:DP:FREQ ./.:.:.:.:. 0/1:48:3:51:0.06 0/1:73:4:77:0.05 0/1:96:4:100:0.04 0/1:67:1:68:0.01 0/1:105:5:110:0.05 0/1:108:2:110:0.02 0/1:92:3:95:0.03 0/1:48:2:50:0.04 0/1:94:4:98:0.04 0/1:104:2:106:0.02 0/1:87:4:91:0.04 0/1:82:3:85:0.04 0/0:102:0:102:0.0
The vcf file was then formated with the following python script:
# Open the VCF file in read mode
with gzip.open(vcf_file_path, 'rt', encoding='utf-8') as vcf_file:
# Iterate through each line in the file
with open(output_file_path, 'a', encoding='utf-8') as output_file:
line_number = 0
for line in vcf_file:
if line.startswith('#'):
continue
# Increment the line number (Move this line outside the loop)
# line_number += 1
# Split the line into fields based on tabs
fields = line.strip().split('\t')
# 9th column forward
fsplit = fields[8].split(':')
modified_fields = ':'.join([fsplit[0], fsplit[2], fsplit[3], fsplit[4]])
fields[8] = modified_fields
idx = 9
for fcolumn in fields[9:]:
tmpsplit = fcolumn.split(':')
tmpjoin = ':'.join([tmpsplit[0], ','.join([tmpsplit[1], tmpsplit[2]]), ':'.join([tmpsplit[3], tmpsplit[4]])])
fields[idx] = tmpjoin
idx += 1
# Write the modified line to the output file
output_file.write('\t'.join(fields) + '\n')
line_number += 1
#print(line_number)
# Print line number every 50,000 lines
if line_number % 50000 == 0:
print(f"Processed {line_number} lines.")
and we corrected the AD field in the vcf header
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
and now grenedalf frequency in running.
Many thanks,
Jennifer