arriba icon indicating copy to clipboard operation
arriba copied to clipboard

How to recover a known fusion that is not in the known_fussions_list?

Open elmerfer opened this issue 2 years ago • 11 comments

Dear Sebasttian

First of all, thank you very much for this outstanding tool. I have the following problem.

In the fusion_discarded file I have the following row gene1 gene2 strand1(gene/fusion) strand2(gene/fusion) breakpoint1 breakpoint2 ..... BCOR NTRK2 -/- +/+ X:40063712 9:84814455 CDS 3'UTR translocation .....

So, I whant to recover it in the main Fusion Table, because I whant to get the fusion_transcript and peptide_sequence.

So, I have added the following line to the end of the uncompressed version of ARRIBA/database/known_fusions_hg38_GRCh38_v2.1.0.tsv.gz file, but it does not work. What I am doing wrong? Thank you very much in advance.

#ZMYND8 CEP250 -20:47209214-47356889 +20:35455164-35519280 Mitelman #BCOR NTRK2 -X:40049815-40177329 +9:84668458-85027070 MYTAG

this is the code system2(command = software$Software$ARRIBA$command, args = c(paste0("-x ",sbjBamFile), paste0("-o ",fusion.file), paste0("-O ",fusion.discarded.file), paste0("-a ",GenomeAssembly), paste0("-g ", GenomeGTF), paste0("-b ", file.path(software$Software$ARRIBA$database, paste0("blacklist_",genomeversion,"v2.1.0.tsv.gz"))), paste0("-k ", "/media/respaldo4t/Softwares/ARRIBA/known_fusions_hg38_GRCh38_v2.1.0.tsv"), paste0("-t ", "/media/respaldo4t/Softwares/ARRIBA/known_fusions_hg38_GRCh38_v2.1.0.tsv"), paste0("-p ", file.path(software$Software$ARRIBA$database, paste0("protein_domains",genomeversion,"_v2.1.0.gff3"))) ))

elmerfer avatar Nov 12 '21 19:11 elmerfer

Your added line looks good to me. Just make sure that it is tab-separated (Arriba should print a warning if that's not the case).

Fusions listed in the know fusions file will only be recovered if they are supported by at least two reads. Otherwise, the false positive rate would be too high. I can't see the supporting read count in your output, so I can only guess that this must be the reason.

In order to force a fusion to be rescued regardless of read support, you can simply put the exact breakpoints in the known fusions file (separated by a tab):

X:40063712	9:84814455	MYTAG

Note: This method is meant to be used for recurrent breakpoints (which are typically at splice sites). You probably don't want to list all possible breakpoint combinations for a pair of genes in the known fusions file. It does not scale well. But if the peptide/transcript sequence is all that you want, then it should do the trick for this one instance.

As a last resort, you can run arriba with -X, in which case it will report peptide/transcript sequences for discarded fusions, too. However, this can be quite slow and the output file will be big. So this is also something I cannot recommend for routine use.

suhrig avatar Nov 12 '21 20:11 suhrig

Dear Sebastian Thank you very much for your answer. In the following Table you can see the discarded fusion that we want to recover (Discarded Sheet) . So you can see what was the possible rejection couse.

And I have some other questions regarding this:

  1. You said: "Fusions listed in the know fusions file will only be recovered if they are supported by at least two reads." But where is this information ? For instance in the returned fusions (not the discarded ones) and present "Fusion sheet" of the following Table, is it possible to see that spleat reads were 0 for both genes. and also with "low" confidence as in the case of the "Discarded sheet" example. How should I interpret this?

  2. We are interested in recover all the potential fusions related to one particular genes (let's say BRAF), that's means that I should add to the known fusion file the following line (tab separated) #BRAF 7:140713328-140924929

  3. You mentioned that we can run Arriba with the -X option. However as you said it may be too slow and generate a too big file. But, do you think that it would be ok if we pick the fusion of interest with the exact braekkpoint (as you said for instance X:40063712 9:84814455 MYTAG) and run arriba in a second run replacing the known_fusions file for a file with such desired fusions? It will work?

Thank you very much in advance. Best regards

Elmer PD: We are using Arriba very frequently, so I'll be back sure with more questions!

elmerfer avatar Nov 16 '21 18:11 elmerfer

Dear Elmer,

You said: "Fusions listed in the know fusions file will only be recovered if they are supported by at least two reads." But where is this information ?

It is hard-coded in Arriba's code. By default, it only rescues known fusions with at least two supporting reads. If both breakpoints are at splice sites (plus some additional conditions are met), then the fusion may be rescued even when it has only one supporting read. If you explicitly specify a pair of breakpoints in the known fusions file, the fusion is always rescued - regardless of read support.

For instance in the returned fusions (not the discarded ones) and present "Fusion sheet" of the following Table

Sorry, I can't quite follow. The title of the table reads "Fusions Discarded", so it's not the returned fusions, it's the discarded ones, isn't it?

How should I interpret this?

Arriba counts split_reads1 + split_reads2 + discordant_mates. If the sum is >= 2 and the fusion is listed in the known fusion file, the event is rescued.

We are interested in recover all the potential fusions related to one particular genes (let's say BRAF), that's means that I should add to the known fusion file the following line (tab separated) #BRAF 7:140713328-140924929

No, this will not work. You have to specify a pair of breakpoints of regions or positions (separated by a tab). You cannot simply specify a single region. Moreover, BRAF should be in the second column, because it is the 3' end of the fusion. In case you have not seen it yet, the format of the known fusions file is described in the manual.

If you want to have a look at all potential fusion candidates, you are best advised to extract all lines related to this gene from the discarded fusions file. Beware that there are many false positives in the discarded file, however.

However as you said it may be too slow and generate a too big file.

If you need it only for one particular sample, it is fine to use the -X option. I just don't recommend using it in a production pipeline, because it can take an hour or two to run and most of the results are not of interest.

and run arriba in a second run replacing the known_fusions file for a file with such desired fusions? It will work?

If you use -X, Arriba will predict the peptide sequence for all discarded fusions - whether they are listed in the known fusions file or not. So it has no effect on the speed/file size if you list your fusions of interest in the known fusions file or not. If you want to speed things up, then you should give Arriba only the region of interest as input. For example, you can extract the reads mapping to the BRAF locus and pass only those reads to Arriba with -X enabled. Since there will be much fewer reads to process (and thus fewer returned fusions/discarded fusions), -X should be quick. Here are the commands for this:

# extract names of reads mapping to BRAF
samtools view "$BAM_FILE" 7:140713328-140924929 | cut -f1 > read_names.txt

# extract all reads from entire BAM file listed in previously created file and pass on to Arriba
samtools view -H "$BAM_FILE" > reads.sam
samtools view "$BAM_FILE" | grep -F -f read_names.txt >> reads.sam

# run arriba on extracted reads
arriba -x reads.sam -X ...

If you use the legacy output file Chimeric.out.sam, then you will have to apply this procedure to both the Aligned.sortedByCoord.out.bam file and the Chimeric.out.sam file, before passing them on to Arriba.

We are using Arriba very frequently, so I'll be back sure with more questions!

Sure, keep them coming. I'm glad to help.

Regards, Sebastian

suhrig avatar Nov 16 '21 20:11 suhrig

Dear Sebastian Here I go again. Up to now I am failing in adding new know fusions to recover. But the -X option works pretty well.

I have some questions regarding some STAR imputs and its impact on Arriba.

  1. What about the "per-sample 2-pass mapping" mode? is it ok to use it?
  2. if so, the use of --sjdbGTFfile will make any effect on detected fusions?

Thank you very much in advance.

elmerfer avatar Nov 25 '21 14:11 elmerfer

Up to now I am failing in adding new know fusions to recover.

Hm, if Arriba does not complain about your manually added lines, then they should be syntactically correct. If you pass the file to the -t parameter with a custom tag as you have done before (MYTAG), then the recognized fusions should at least get tagged in the tags column in the discarded fusions file (even if they are not rescued). If they are tagged, then this means that some other filter overwrites the known fusions filter, which is intended behavior, because there would be too many false positives if everything in the known fusions file was rescued without exception. If they are not even tagged, then Arriba does not recognize your newly added fusion for whatever reason. I will do some tests on my end to see if there are bugs in the known fusions filter or in parsing of the files.

What about the "per-sample 2-pass mapping" mode? is it ok to use it?

I use it frequently. I am not aware of any problems. You may get some more read-through fusion predictions than in single-pass mode, because two-pass mode mainly improves alignment of spliced reads with short overhang. But nothing of concern.

if so, the use of --sjdbGTFfile will make any effect on detected fusions?

This parameter is normally used during index generation. There is not necessarily a need to use it for the mapping stage. You can use it for mapping to inject annotation that was not available during index generation. This will probably have an effect on the alignments and thus fusion prediction. However, if the additional splice junctions added this way are of good quality, then it should only improve the detection rate, because splice junctions are very helpful in distinguishing true fusions from artifacts.

suhrig avatar Nov 25 '21 15:11 suhrig

I did some further tests with the known fusions filter. I found a minor bug: It did not complain when mandatory columns were empty and just silently ignored the affected lines. Other than that, the filter works fine as far as I can see. The most likely explanation why it seems to have no effect in your case is because a later filter overrules it. If you want, I can have a look at you specific case. But you will have to send me the FastQs (or at least the reads mapping to your genes of interest) and your edited known fusions file. Otherwise, I don't think I can debug this issue properly. Let me know if this is an option for you.

suhrig avatar Nov 27 '21 17:11 suhrig

Dear Sebastian. Yes! I was thinking to send you the fastq files. I will do it soon with some free available experiments that we are facing. In the meantime the -X works fine, spite of it is a little bit more expensive in computation time and the discarded file is quite large, it is still affordable and better to reprocess all the bam file again with the added desired fusions, which are subject specific.

I think that the problem is mine and not your code, but I should do more test to figure it out better.

I will keep you updated ;-)

Thank you very muich for your time!!! I really appreciatte.

Best Elmer

elmerfer avatar Nov 29 '21 17:11 elmerfer

On second thought, it may be best to send me the BAMs. This way, I will save the effort to do the alignment, and - more importantly - there is no chance that my results will differ from yours due to different alignment parameters.

suhrig avatar Nov 29 '21 17:11 suhrig

yap! you are right. Let's do this

elmerfer avatar Nov 29 '21 18:11 elmerfer

Hi @elmerfer,

Do you still plan on sending the BAM file(s)? Or is this issue no longer relevant?

Kind regards, Sebastian

suhrig avatar Apr 14 '22 21:04 suhrig

Dear Sebastian Thank you very much for ALL your advise and interest. Up to now, we can solve it as you told me, using the -X option and then filering out all those without any read in the three columns 10 , 11 and 12 of the filtered.tsv file. Then I look for the ones I want in both files.

This is working in the meantime. But, since we have very specific fuions and/or genes to look up, I have to try the other way around. But up to now i did not have the time to test it.

I will try to do it with your test file or i will look for an open access RNAseq file and I will let you know.-

Best regards-

elmerfer avatar Apr 19 '22 18:04 elmerfer