unconf15
unconf15 copied to clipboard
A package for higher-level R metadata extraction
Basically there should be a higher-level way to extract metadata from directory and filenames. For example:
150202_HS2A/Project_XXX/Sample_OEP02_N712_S508/OEP02_N712_S508_GTAGAGGA-CTAAGCCT_L002_R1_001.fastq.gz
There's a ton of metadata in this file that we should be able to quickly extract and work with (e.g. in a dataframe). The (initial) goals of this package are:
- Load in and extract file metadata.
- Queries on this metadata — think
list.files()that's way more powerful.
\o/
Should we wrap list.files() at all? Should we use Python's groupdict.
Some example usage:
files <- list.files(recursive=TRUE, full.names=TRUE)
md <- extract_metadata(files, "data/(?P<samples>\\w+)/(?P<replicates>\\w+)/file_(?P<file_number>\\d+).txt")
Is md a dataframe? If so, we can use dplyr and purrr to process each entry. Or we could use base functions, e.g.:
lapply(split(md, md$samples), read_function)
Just some ideas...
Should we have hooks that validate certain metadata?
A place to get started: https://github.com/vsbuffalo/pathfindr fork, branch, code, and be merry!
EDIT: the name is terrible, but it's a placeholder.
Rather topic specific, but in the Aroma Framework we utilize such "meta" data encoded in file and directory names, cf. http://aroma-project.org/docs/HowDataFilesAndDataSetsAreLocated/. It's been in place since 2006. It also support on-the-fly reparsing, e.g. filtering out character sequences without information and reordering etc. The essence of it is in the R.filesets package. It's been a long-term wish to extend this to a generic framework based on regular expressions, but there's never been an urgent need for it.
How about we use a different data format for genomic data that encodes the metadata within the file. :) I can't really see it happening, but I'm not really joking either. HDF5?
We were talking about this with @jarrodmillman but I personally don't see that happening unless Illumina (and/or another big player) pushes for it. It would be cool, though...
This reminds me of two other thoughts I've had in the past, related to what sort of larger package this functionality might fit in.
[1] A package that implements some standard unix commands but gives the result in the most sensible R native format and anticipates piping them together with %>%. In the current case, the idea would be to emulate ls and to offer some of the most expected arguments and to get the result back as a data.frame. If memory serves, the current task of parsing metadata in file names was exactly my own motivating case, as this comes up often for me as well.
[2] It also feels like R needs a package for operating on files and paths. Google and the hotel wifi are keeping me from pointing to a great example from another language but I know such exist. Something that goes beyond file.path(), base name(), etc. I even feel like someone has taken a stab at this but can't remember who/when. Anyone else remember?
E.g. https://docs.python.org/2/library/os.path.html in Python.
node.js has a bunch of them, they are pretty basic. I guess this is what you need most of the time. E.g. https://www.npmjs.com/package/fs-extra
Edit: also, http://cran.r-project.org/web/packages/pathological/index.html
I started a direct port of os.path and will push that up to gh - it's mostly just going to be grunt work to get all the bits filled in, so if other people have a need we'd get a naive R version done pretty quickly.
@richfitz That sounds great. Happy to help fill in bits.