cacao
cacao copied to clipboard
Option to include/switch reference data and a global preview table
Hi Sigve
Thanks for another awesome framework. We (@umccr) are very much interested to incorporate this into our reporting.
I have looked at the github repo/code and tested it locally - it works great. We have a couple of questions/comments:
- Would it be possible to feed our own reference (bed) files?
We are interested in using some of the reference data from Hartwig. Looking at the codebase, it shouldn't be a problem as the data
directory is passed as an argument and this directory contains reference data to be used for the analysis. However, this might impact the annotations that the framework reads in from the .tsv(s)
in `cacao_utils.R for specific clinical genomic tracks?
There is an optional flag --target
, which according to my understanding refers to the targeted region in the input sample?
- Would it make sense to have one global table (that checks coverage for specific genes), stratified by callability - instead of having to go through multiple tracks?
This probably links back to point 1 i.e. feeding in one specific bed
track (in this case) which could be joint set of various loci sources such as CIViC, CGI and OncoKb and then reading in the (optional annotations as in the code base) for this data - if this idea aligns well with your original idea of the framework?
- It would be useful if we could include an option to limit hereditary cancer - pathogenic loci table to cancer predisposition genes that is also used/referenced here https://github.com/sigven/cpsr?
Sorry about the long commentary and thanks for your time.
Cheers, Sehrish
Dear Sehrish,
Thanks a lot for your input, highly valuable! Generally, I can say that what you suggest makes perfect sense as a further development of the workflow. And parts of your ideas have been mentioned by some other colleagues here. I will get back to you shortly with my ideas/comments on what is realistic short term etc., very busy here today.
PS. You are correct about the --target
, this should refer to the targeted region of the input sample. But I have in fact not implemented this one yet, so it is currently only there as a placeholder. Will update that shortly.
regards, Sigve
Hi Sigve,
Thanks for the response and I look forward to hearing back from you. Happy to coordinate/contribute always.
Regards, Sehrish
Hi Sehrish,
Coming back to this:
- What is the reference data from Hartwig? Although passing your own reference data my be quite challenging to process on the fly, I need to get some overview of how it looks in order to evaluate this
- Regarding the global preview: Seems you are you here thinking of a coverage pr. gene, which I understand is useful. The intention of CACAO was to primarily investigate coverage at the variant loci (pathogenic germline variants, somatic hotspots etc.), but I see this point. But then we should also have an idea of what we consider as the "gene"; the coding sequence only? or all genic sequence (introns, UTRs etc).
Would appreciate your input on this.
regards, Sigve
Hi Sigve,
Thanks for getting back to this.
• Reference data from Hartwig is:
- Point mutations from CIViC - Cacao, I understand, uses CIViC to calculate callability for actionable somatic variants
- Somatic variants from CGI
- Oncogenic/likely oncogenic variants from OnkoKB (I know you might be cringing on this considering the apprehensions around licensing). I’ll discuss this with Oliver and team to find if we really want to use this.
Also, I do appreciate the point that we need to understand what data we are going to use for presentation as it’s hard processing reference input on the fly.
• Global preview: We are hoping to begin with focussing on coding regions.
It would be definitely very useful to have the ability to switch to whole gene (including introns). But we can expand on this later.
Happy to have your feedback on this and start looking into implementation as well - if this sounds feasible/useful to you.
Regards, Sehrish