Simultaneously examine multiple runs
Interested if people thinks this makes sense, and how much work it would entail.
For the pt189 analysis, I'd like to view pileups for an arbitrary set of BAMs (normal exome, plus multiple somatic exomes and RNA-seq runs). I also have a number of VCFs that I'd like to visualize in this context. The kind of question I'm trying to answer is: "our DNA exome somatic caller found a variant at this locus. In which RNA samples do we see any indication of that variant?"
I'm imagining that the example page for a "project" would provide a way for the user to select a subset of VCFs and BAMs to examine.
This makes sense. It's definitely something we've been thinking about, and is a primary impetus behind my spending time getting Impala working with CycleDash (since such analyses can grow quickly).
The functionality is almost there (just needs to implemented on the frontend) for you to choose an arbitrary VCF to compare against (get the Prec/rec/f1 + the true positive checkboxes). I'd also like to extend it to work across N VCFs instead of just 1.
Is there other functionality you're envisioning? Extending CQL to allow queries answering that question seems important, for starters.
related to #430
Cool. I'll throw out another idea I've been thinking of surrounding this; maybe it also lines up with something you've been thinking.
I think the "run" concept may be limiting. I really just have a bunch of VCFs and BAMs that are grouped into a "project". It doesn't really matter what BAMs were used to generate what VCFs.
For managing projects (i.e. adding and removing BAMs or VCFs), I'd ideally like to work with a text file that specifies paths to BAMs and VCFs, instead of a web interface. This file can be edited with a text editor and managed with git. Other tools we write like Guacamole could optionally take these text files instead of raw paths to BAMs and VCFs. I would ideally like to be able to name my BAMs and VCFs in these files and associate "tags" with them. Just thinking out loud, here's a quick sketch of what a file could look like:
project_name="pt189",
resources = [
{
type="reads",
name="normal_exome",
path="hdfs:///datasets/path/to/some/reads.bam",
tags=["exome", "normal", "illumina"]
extra = {
# These are arbitrary additional metadata the user wishes to associate with this resource, which CQL can use for filtering.
depth=30,
}
}
{
type="reads",
name="rna_ribozero_left_ovary_primary",
path="hdfs:///...",
tags=["rna", "tumor", "illumina", "left_ovary"]
}
{
type="variants",
name="germline_variants_haplotypecaller"
path="hdfs:///somewhere.vcf",
tags=["exome", "germline"]
extra = {
haplotypecaller_commandline="java -jar ..",
}
}
]
I might have a few dozen "resources" in one of these files for a project. Then, I would want to be able to load it up in CycleDash, and then write CQL for things like "show me all variants and reads coming from BAMs or VCFs with the 'rna' and 'tumor' tags."
Where these files live / how CycleDash accesses them would be one thing we'd have to figure out. HDFS or NFS on demeter could be one place. Could also keep them on github and give Cycledash http links to them?
Just to add one other idea here: if we expand CQL to let me say stuff like "select reads with average base quality > 30 from BAMs with tags 'rna" and 'exome'", then we could perhaps have tools like Guacamole taking a "--project" argument for the path to the project file and a "--cql" argument specifying which reads to extract from the BAMs referenced in that file.
Realize there was a lot here. This is roughly akin to a system for managing simulations where I worked before. It worked fairly well there. Interested what people think.
I'd like to see us move away from files as an entity for analysis. I'd rather see metadata attached to entities like contigs, alignments, genotypes, variants or collections thereof.
Similarly, I'd much rather see metadata managed by a database with a user interface rather than serialized to JSON in a text file. You can always export to JSON and edit it by hand if that's your thing. That should not be the primary representation though.
Gotcha. My main point is that we should consider having our "project specification" be useable by other tools like biokepi or Guacamole in addition to Cycledash. For example, I'd like to be able to tell biokepi to "run bwa+varscan over all my illumina exome data for this patient", and then pull up the results in Cycledash. (Here, "illumina" and "exome" would be tags I manually associated with certain datasets in my project.)
I've had better luck in the past with a simple text file than with a centralized database for managing projects (DESRES tried both), but I can imagine the database working well if the UI is good. The database does have the nice advantage of not requiring a shared filesystem.
Dreaming, but, could we generate a project spec from the DAG Ketrew specifies + metadata? And would that solve the problem you're looking to solve?
Tracking the provenance of generated datasets using the Ketrew DAG could be useful, but is separate from what I'm thinking. I'm imagining a way to associate reads, variants, and other datasets with a project, label them with tags appropriate to that project, and then use those labels instead of filenames to invoke pipelines or see Cycledash visualizations.
Yeah, I was thinking the tagging + naming part would be "metadata", and the paths + transformations would be the DAG. In a way, we support a lot of this already (well, @smondet sends the data to us) with the "notes" field. We just don't do anything with that data, or enforce any structure on it. You can update it and get at it with the existing API, though.