unconf15 icon indicating copy to clipboard operation
unconf15 copied to clipboard

A package for higher-level R metadata extraction

Open vsbuffalo opened this issue 10 years ago • 11 comments

Basically there should be a higher-level way to extract metadata from directory and filenames. For example:

150202_HS2A/Project_XXX/Sample_OEP02_N712_S508/OEP02_N712_S508_GTAGAGGA-CTAAGCCT_L002_R1_001.fastq.gz

There's a ton of metadata in this file that we should be able to quickly extract and work with (e.g. in a dataframe). The (initial) goals of this package are:

  1. Load in and extract file metadata.
  2. Queries on this metadata — think list.files() that's way more powerful.

vsbuffalo avatar Mar 26 '15 20:03 vsbuffalo

\o/

karthik avatar Mar 26 '15 20:03 karthik

Should we wrap list.files() at all? Should we use Python's groupdict.

Some example usage:

files <- list.files(recursive=TRUE, full.names=TRUE)
md <- extract_metadata(files, "data/(?P<samples>\\w+)/(?P<replicates>\\w+)/file_(?P<file_number>\\d+).txt")

Is md a dataframe? If so, we can use dplyr and purrr to process each entry. Or we could use base functions, e.g.:

lapply(split(md, md$samples), read_function)

Just some ideas...

vsbuffalo avatar Mar 26 '15 20:03 vsbuffalo

Should we have hooks that validate certain metadata?

vsbuffalo avatar Mar 26 '15 21:03 vsbuffalo

A place to get started: https://github.com/vsbuffalo/pathfindr fork, branch, code, and be merry!

EDIT: the name is terrible, but it's a placeholder.

vsbuffalo avatar Mar 26 '15 21:03 vsbuffalo

Rather topic specific, but in the Aroma Framework we utilize such "meta" data encoded in file and directory names, cf. http://aroma-project.org/docs/HowDataFilesAndDataSetsAreLocated/. It's been in place since 2006. It also support on-the-fly reparsing, e.g. filtering out character sequences without information and reordering etc. The essence of it is in the R.filesets package. It's been a long-term wish to extend this to a generic framework based on regular expressions, but there's never been an urgent need for it.

HenrikBengtsson avatar Mar 26 '15 23:03 HenrikBengtsson

How about we use a different data format for genomic data that encodes the metadata within the file. :) I can't really see it happening, but I'm not really joking either. HDF5?

tracykteal avatar Mar 27 '15 18:03 tracykteal

We were talking about this with @jarrodmillman but I personally don't see that happening unless Illumina (and/or another big player) pushes for it. It would be cool, though...

drisso avatar Mar 28 '15 01:03 drisso

This reminds me of two other thoughts I've had in the past, related to what sort of larger package this functionality might fit in.

[1] A package that implements some standard unix commands but gives the result in the most sensible R native format and anticipates piping them together with %>%. In the current case, the idea would be to emulate ls and to offer some of the most expected arguments and to get the result back as a data.frame. If memory serves, the current task of parsing metadata in file names was exactly my own motivating case, as this comes up often for me as well.

[2] It also feels like R needs a package for operating on files and paths. Google and the hotel wifi are keeping me from pointing to a great example from another language but I know such exist. Something that goes beyond file.path(), base name(), etc. I even feel like someone has taken a stab at this but can't remember who/when. Anyone else remember?

jennybc avatar Mar 28 '15 02:03 jennybc

E.g. https://docs.python.org/2/library/os.path.html in Python.

node.js has a bunch of them, they are pretty basic. I guess this is what you need most of the time. E.g. https://www.npmjs.com/package/fs-extra

Edit: also, http://cran.r-project.org/web/packages/pathological/index.html

gaborcsardi avatar Mar 28 '15 03:03 gaborcsardi

I started a direct port of os.path and will push that up to gh - it's mostly just going to be grunt work to get all the bits filled in, so if other people have a need we'd get a naive R version done pretty quickly.

richfitz avatar Mar 30 '15 23:03 richfitz

@richfitz That sounds great. Happy to help fill in bits.

karthik avatar Mar 31 '15 01:03 karthik