svaba icon indicating copy to clipboard operation
svaba copied to clipboard

"Duplicate" kgID when doing annotation.

Open sxwcasd opened this issue 2 years ago • 2 comments

I am trying to running the annotation (svaba-annotate.R) on GENCODE db. However, the UCSC db records pulled by these 2 lines. https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L59-L60

Are having duplicates. example:

                   kgID         mRNA      geneSymbol   spID       refSeq chrom   txStart     txEnd strand
  1: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrX 155997695 156010817      +
  2: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrY  57184215  57197337      +

This is making sense to me, that the sex chromosomes have different position and share the some mRNA. But this will hit error at the following line: Error: !any(duplicated(genes$kgID)) is not TRUE

Maybe we have have a better validation check at here?

sxwcasd avatar Jun 23 '22 20:06 sxwcasd

Hmm, that is an unusual situation with a gene listed on multiple chromosomes -- interesting catch. Happy to entertain a pull request if you have a better validation scheme.

On Thu, Jun 23, 2022 at 4:37 PM Linghao Song @.***> wrote:

I am trying to running the annotation (svaba-annotate.R) on GENCODE db. However, the UCSC db records pulled by these 2 lines.

https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L59-L60

Are having duplicates. example:

               kgID         mRNA      geneSymbol   spID       refSeq chrom   txStart     txEnd strand

1: ENST00000244174.11 NM_002186 IL9R Q01113 NM_002186 chrX 155997695 156010817 + 2: ENST00000244174.11 NM_002186 IL9R Q01113 NM_002186 chrY 57184215 57197337 +

This is making sense to me, that the sex chromosomes have different position and share the some mRNA. But this will hit error at the following line: Error: !any(duplicated(genes$kgID)) is not TRUE https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L67

Maybe we have have a better validation check at here?

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7CG5OHOZAOFME3GMGODVQTDI7ANCNFSM5ZVTSDKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

walaj avatar Oct 11 '22 07:10 walaj

Because we need both gencode and exonframe information, we loaded wgEncodeGencodeCompV36 table instead of knownGene or kgXref. And validate by start and end sites. I m not sure if that make sense to your original design (because we don't need to consider refseq id mapping). But it seems accomplished our goals and get around this issue.

sxwcasd avatar Oct 14 '22 17:10 sxwcasd