annotables icon indicating copy to clipboard operation
annotables copied to clipboard

better version/build management

Open stephenturner opened this issue 8 years ago • 3 comments

with the changes in #6 it's much easier to recreate annotation tables. the files are named e.g. galgal5, but which version/build is actually used depends on what's current in ensembl. e.g., when I first built this package, chicken was on galgal4. i had to manually update the filenames, and I probably did the wrong thing by just deleting (rather than deprecating) the old datasets. maybe that's okay since it's still versioned in a release. not sure how to best handle these issues.

stephenturner avatar Jan 19 '17 21:01 stephenturner

🤔...

One potential solution: name recipes and tables based on species, so hsapiens.yml would create a table called hsapiens that includes annotations for whatever the most recent build/version happens to be.

Previous versions could be specified by appending the version number. Most users will (probably) want the most up to date info and only need to type hsapiens, users with more specific needs would have to type something like hsapiens_GRCh37.

What's your opinion on providing previous genome versions?

We could maintain recipes for older builds and provide a function that allows users to build them locally. That way they're still easily accessible for reproducibility purposes without causing the package size to explode.

aaronwolen avatar Jan 20 '17 13:01 aaronwolen

I do think there's a need to be able to maintain or recreate older versions. I operate a core facility - I've had folks that I've done analysis for years ago using, e.g., Galgal4, but if I now created or recreated the data, it'd be galgal5. Also, for human specifically, lots of folks (me included) are still using GRCh37.

There might be a few ways to manage this. I think you'd need to know which archive version of ensembl you'd need to go after to get the build you're interested in. Also, maybe there's some way to retrieve and record this information from the biomart query.

I do like the idea of just typing hsapiens... I'm sure there's a way to "alias" different names to the same dataset. Not very experience with R data package creation. This is my first/only.

stephenturner avatar Jan 20 '17 18:01 stephenturner

This is a good point. Attaching GRCh38 data to an object called hsapiens would probably violate user assumptions. Perhaps it's better be more explicit and stick to naming objects after the relevant genome version?

I'm also in a bioinformatics core and frequently switching between different projects that require different genomes/builds, so I loved the idea of annotables. It can be a real time saver!

aaronwolen avatar Jan 20 '17 20:01 aaronwolen