unconf16
unconf16 copied to clipboard
Standard(s) for inferring metadata from directory and file names
Outcome: A new package
UPDATE: This suggestion was work on during rOpenSci Unconf 2016 and resulted in the dirdf package.
Call
@vsbuffalo, @karthik, @gaborcsardi, @richfitz, @jennybc et al., don't you think Unconf15 topic A package for higher-level R metadata extraction deserves a bit more love?
Summary of packages / software
R Packages
- https://cran.r-project.org/package=pathological
- https://cran.r-project.org/package=pystr (Python-style string operations)
- https://cran.r-project.org/package=R.filesets (in production since 2006; directory and file names with comma-separated tags; designed for Aroma Project)
- https://github.com/vsbuffalo/pathfindr/ (prototype)
- https://github.com/HenrikBengtsson/pathinfo/ (prototype)
- https://github.com/richfitz/pathr (prototype)
Python packages
- https://docs.python.org/2/library/os.path.html
Javascript / node.js
- e.g. https://www.npmjs.com/package/fs-extra ("basic")
EDIT 2016-03-04: Added summarized of packages/software mentioned in last year's thread. Updated with those mentioned in this year's thread.
Yes! @nicolewhite might possibly be interested too with her port of some string handling things: https://github.com/nicolewhite/pystr
Though in terms of @vsbuffalo's original topic, I now endorse meaningless filenames backed by a lookup to a key-value store for the metadata.
Yeah, encoding metadata into file names is a neat trick, but it has its limits....
I now endorse meaningless filenames backed by a lookup to a key-value store for the metadata.
does that include version information in data identifiers? :-)
(sorry if that sounded wrong, meant to say that I was curious to understand a bit better when this does or doesn't work; never felt like I had a good idea one way or the other when we were discussing this in terms of data versioning. a good topic for more exploration).
"Meaningless filenames" fills me with existential dread.
@HenrikBengtsson don't you have something on your R wish list about a class for file paths? Yes here: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/9. That feels possibly related to this.
@jennybc, I'd say it's only somewhat related, but yes, I can imagine that we implement a file metadata API on top of classes like what're proposed in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/9. Though, I don't think we need to figure out the latter in order to make progress on this one.
FYI, I've updated the top comment with a summary of packages/software mentioned here and in last year's thread.
@HenrikBengtsson I currently use my own package for the project I'm working in, where the filenames have a particular structure (villageID_participantID_date_etc + extension). It's not an optimal way of storing data but it's the current state of the project data.
I check that filenames have the right structure depending on the extension and then I check different things based on the logsheet (which files do I expect for each participantID, etc. -- the fact I use the logsheet is the reason I cannot make it public yet) once I've parsed all filenames. I now wonder how you would make this general? Or were you thinking of writing guidelines?
https://usecanvas.com/anonymous/pathmetadata/2VXxpEVm8W2UMIb86RbmoO
https://github.com/ropenscilabs/dirdf
EFFFFF ME
R's regex DOES support named capture groups! The syntax is just different than what I have used in the past, and perl = TRUE is required. I'll try to submit a pull request--this should let users pass regexes without requiring a separate colnames column.