TALON icon indicating copy to clipboard operation
TALON copied to clipboard

Transcripts identified possibly without supporting reads.

Open JasperBoom opened this issue 3 years ago • 7 comments

Dear TALON team,

We are analysing a nanopore human dataset using TALON and came across some interesting transcripts with a high abundance. Looking further into these transcripts, we hoped to see which reads were the basis for the transcripts.

We came across a number of transcripts that don't seem to have reads supporting the genomic start and end positions and the full body of the transcript when looking at them in IGV. But there are a lot of reads assigned to the transcript in the abundance file.

I was hoping you could maybe help explain what could be happening here, and what we are seeing?

I am adding the bam and sam files from a region I extracted (chr11:2159000-2160000) and used that to rerun TALON on a small subset. I filtered out reads which had a specific insert that match the known INS-IGF2 gene (787N) to try and remove reads that showed a high expression of the known gene in that region. This example does concern a genomic transcript (not NIC or NNC) (TALONT000227914).

I've also added the abundance and gtf files.

The transcript does seem to match the start of one read: IGV_1

And the end of another: IGV_2

But no reads fully cover and only cover this transcript (a few are longer).

We initially found this transcript when running 6 samples together. The transcript was also detected when using one sample and also when I used this extracted region.

Hopefully you have time to help understand this. Thank you very much for your time!

Cheers, Jasper

I am using the most recent version of TALON by cloning the repo and using pip install .. Alignment using Minimap2 & correction using TranscriptClean (default settings).

Database:

talon_initialize_database \
    --f=gencode.v33.annotation.gtf \
    --g="hg38" \
    --a="gencode_v33" \
    --o=filter_based_on_787N

TALON:

talon \
    --f filter_based_on_787N_samplesheet.csv \
    --db filter_based_on_787N.db \
    --build "hg38" \
    --threads 5 \
    --o filter_based_on_787N

Abundance:

talon_abundance \
    --db=filter_based_on_787N.db \
    -a "gencode_v33" \
    -b "hg38" \
    --o=filter_based_on_787N

GTF:

talon_create_GTF \
    --db=filter_based_on_787N.db \
    -b "hg38" \
    -a "gencode_v33" \
    --observed \
    --o=filter_based_on_787N

example.zip

JasperBoom avatar Jan 07 '21 12:01 JasperBoom

Hi there,

TALON treats transcript starts and ends differently than splice junctions, that is, there is flexibility in terms of distance of the read's start/end to the annotated (in the case of Known transcripts) or created (in the case of novel transcripts) starts or ends of the transcript model. This flexibility is tunable for the talon_initialize_database step using the --3p and --5p options.

That being said, TALON outputs a file that easily shows you what transcript each read belongs to so you don't have to go through all the effort of filtering your sam files etc. Take a look at the read_annot.tsv file that was output from your TALON run. Its contents are detailed on the mail TALON README but in short there is a row for each read along with a column for what transcript that read was assigned to.

Let me know if you have any more questions!

fairliereese avatar Jan 18 '21 23:01 fairliereese

Hey!

Thank you for your input! I checked out the read annotation file you mention.

We exracted the reads assigned to this particular transcript and looked at them in IGV. I now see the one read that is responsible for the start of the transcript clearly (the first in IGV). And I spotted some short reads that match the end of the transcript.

However, we were wondering why some of these long reads shown in IGV, that have a short partial match to the end of the newly annotated transcript, are counted as supporting reads for this transcript? It seems that the largest part of those reads don't cover the transcript. Is this simply because the ends of the reads are within the default 500 bp range?

chr11:2159300-2159997 (TALONT000227914)
chr11:2159779-2161195 (3rd read from the top)

The abundance for TALONT000227914 is shown as 235.

Thank you again for your time!

igv

TALONT000227914.zip

JasperBoom avatar Jan 19 '21 15:01 JasperBoom

As you mention, this behavior is due to the --5 and --3 settings if all of the reads you see have starts and ends within the default windows given for these arguments.

This effect is certainly exaggerated as you can see in your example where you have a bunch of monoexonic reads that all map to roughly the same locus. Because TALON cannot perform intron chain matching (as there are no introns) between reads belonging to this locus, it instead has to rely on its 5' and 3' identification in order to determine which transcript each read is assigned to.

fairliereese avatar Feb 03 '21 18:02 fairliereese

Hey!

My supervisor pointed out that I was not entirely correct with regards to the distance between the 3' and 5' ends of the transcript and some of the reads being assigned to the transcript.

I have collected the genomic positions of the first seven reads shown in the IGV screenshot below. I have also added the genomic position of the transcript in question TALONT000227914 (red arrow).

## TALONT000227914
* chr11:2159300-2159997

## 5df4c53d-6e19-4a71-8d28-bf949a02379c (read 1)
* chr11:2159300-2160161

## af6c6120-fb71-4d6c-85c0-e634291ace2d (read 2)
* chr11:2159786-2161175

## dec48831-8971-4928-b597-794836b7cb60 (read 3)
* chr11:2159779-2161195

## 5befd74a-3f20-4d27-ac8c-9741b9e1afb0 (read 4)
* chr11:2159797-2161195

## c34eb839-f192-4056-bbc5-cd7c590f9fd6 (read 5)
* chr11:2159778-2160638

## cc074cc0-ad66-4d56-b22a-07361f09f88c (read 6)
* chr11:2159780-2160969

## 0ddcb8e9-5077-41f5-b2a9-1f3357040191 (read 7)
* chr11:2159780-2160972

The first read is the one responsible for the start location of the transcript and fits fine.

But it seems that the distance between, for example, the second read (af6c6120-fb71-4d6c-85c0-e634291ace2d) 5' end and the 5' end of the transcript is beyond the default 500 bp. The same goes for the 3' end (300 bp default).

Are the --3p and --5p settings used in relation to the distance between the novel transcript and the reads or is this threshold set in some other way?

Are we correctly seeing that the threshold is not being enforced here?

Thank you for your help!

GenomicPositionsTALONT000227914Marked

JasperBoom avatar Feb 10 '21 11:02 JasperBoom

Good catch, upon double checking of your read starts and ends compared to the transcript's start and end, it looks like all 5' coordinates are within 500 bp of the transcript start, but not all 3' coordinates are within the default 300 bp of the transcript end. I'm going to reach out to Dana to ask if she has any insight on this behavior. I will let you know what we find.

fairliereese avatar Feb 10 '21 18:02 fairliereese

Hi there, after following up with Dana, both of us are a little puzzled by this behavior and not sure what it stems from but it seems to boil down to the fact that working with monoexonic transcripts is tricky. I can investigate it further with the data you've provided but can't promise anything any time soon.

If you're interested in taking a look into it yourselves, some functions you might want to take a look at are in the main talon.py file, specifically the match_monoexon_vertices and identify_monoexon_transript functions.

If you decide to check it out please don't hesitate to ask me about details about the program you find confusing, if any!

fairliereese avatar Feb 17 '21 20:02 fairliereese

Thank you for looking into this! I will let you know if I have any questions!

JasperBoom avatar Feb 22 '21 08:02 JasperBoom