RagTag icon indicating copy to clipboard operation
RagTag copied to clipboard

`updategff` produces wrong output if parts of the scaffolds are used in the AGP file

Open taprs opened this issue 5 months ago • 0 comments

Hi!

Not sure if that was the indended use of the updategff command (or even the AGP format), but I was trying to update annotation based on an AGP file where I introduced a break into the chromosome and got corrupted GFF file. Minimal example follows:

echo $'## gff-version 3.1\nbibaboba\ttest\tgene\t5\t6\t.\t+\t.\tID=sas' > example.gff
echo $'## agp-version 2.1\nbiba\t1\t4\t1\tW\tbibaboba\t1\t4\t+\nboba\t1\t4\t1\tW\tbibaboba\t5\t8\t+' > example.agp

$cat  example.gff

## gff-version 3.1
bibaboba        test    gene    5       6       .       +       .       ID=sas


$cat example.agp

## agp-version 2.1
biba    1       4       1       W       bibaboba        1       4       +
boba    1       4       1       W       bibaboba        5       8       +

Obviously, positions 5 and 6 of bibaboba should then be converted into positions 1 and 2 of boba. But what I get is

$ ragtag.py updategff example.gff example.agp
Fri Sep 13 11:10:44 2024 --- VERSION: RagTag v2.1.0
Fri Sep 13 11:10:44 2024 --- CMD: ragtag.py updategff example.gff example.agp
## gff-version 3.1
boba    test    gene    5       6       .       +       .       ID=sas
Fri Sep 13 11:10:44 2024 --- INFO: Goodbye

The coordinates do not get transformed.

For completeness, I attach the solution that leaves me with the correct coordinates (for BED files though):

#!/usr/bin/gawk -f
 BEGIN{
 help="\
    This script updates a BED file using the transformation defined by an AGP file. \n\
    That is, BED in coordinates of contigs gets transformed to coordinates of scaffolds. \n\
    Pleaze bgzip and tabix the BED file before operation. \n\
    EXAMPLE: updategff.awk file.agp file.bed.gz > file_transform.bed"
   if (ARGC < 2) {print help; exit 1}
   OFS="\t"
 }

 # Exploit tabix to query bed files

 NR==FNR && !/^#/ && $5=="W" {
   cmd="tabix " ARGV[2] " " $6":"$7"-"$8
   while ( cmd | getline bedline > 0 ) {
    split(bedline, bed)
    if (bed[2] < $7 || bed[3] > $8) { next }
    if ($9=="+"){
     printf "%s\t%s\t%s", $1, bed[2]-$7+$2, bed[3]-$7+$2
    } else {
     printf "%s\t%s\t%s", $1, $8-bed[3]+$2, $8-bed[2]+$2                                                                                                                           }
   if (length(bed) > 3) {
     for (i=4;i<=length(bed);i++) {
      printf "\t%s", bed[i]
      print ""
     }
    }
   }
   close(cmd)
 }

taprs avatar Sep 13 '24 09:09 taprs