rnaseq
rnaseq copied to clipboard
More flexible samplesheet
I think it would be great if the samplesheet could handle additional columns (which would be ignored by the RNA-seq pipeline, but would be used by some downstream analysis). Keeping these in a separate file leaves potential for human errors.
It would also be cool if the columns could be in arbitrary order.
To this end, it could maybe help to either use the stdlib csv parser or pandas read_csv instead of manually splitting the lines.
Potentially, these libraries would also be immune to "hidden carriage returns" (what came up on slack recently) or other caveats like quoted fields etc.
The pipeline should be able to handle additional columns already @grst.
Not convinced arbitrary order is a good thing though mainly for standardisation but happy to be persuaded otherwise.
Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;) I was scrimping on having additional dependencies which is why I also went for a native solution but that shouldn't be a valid reason!
The pipeline should be able to handle additional columns already @grst.
Cool! I could have sworn there was an issue the last time I tried, but probably I didn't add the additional columns at the end. I'll try it again next time!
Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;)
I can definitely put something together when I find some time. I understand the concern with the dependencies. csv
is in the standard library, so it wouldn't need any dependencies either. I have never used it in favor of pandas, though. It's probably slower than pandas, but that shouldn't be a concern for samplesheets.
Some more thoughts:
- Could this be abstracted into a module that can be reused across pipelines, or is that overkill? Maybe even integrate the samplesheet definition in the
json
schema :thinking:? - Maybe it should be implemented directly in groovy instead of Python? (I hardly know any groovy, but this could be an excuse for learning it)
Anything fancy we can do here @grst to improve our current implementation? We have now added a stand-alone samplesheet schema to help with the validation in https://github.com/nf-core/rnaseq/pull/623 but there are still some fundamental issues that may require outside scripts for this - some discussion in https://github.com/nf-core/rnaseq/pull/633
I'll take a look at the issues you referenced... First step would be a reproducible example for the utf8-bug, though :thinking:
Yeah, I know. That 🐛 is really annoying and I suspect it will be quite an easy fix. Just haven't found a portable way to test it!
The json-scheme and groovy-based validation look neat! That would make the python script obsolete anyway?
You mentioned the stripping of quotes is done in python, but there should be some way of achieving this in Groovy?
header = [x.strip('"') for x in fin.readline().strip().split(",")]
Yup, most of the Python stuff would be obsolete!
I think the problem is that the validation needs to happen with standard libraries before it is passed to the pipeline in order to make it more portable but it doesn't have this option. Maybe @ewels can confirm.
Additional java/groovy jars
could be placed in lib
as part of the pipeline template...
@grst maybe you could take a look at https://github.com/nf-core/tools/pull/1282 and see if that would fit your needs? I agree that doing the job in Groovy is preferable and consider my PR only a temporary change if it is at all wanted.
Looks neat! I lost track of the discussion around the "samplesheet schema". Is your implemenation already part of that or would the schema validation be the next step?
@ewels showed my his work on the samplesheet schema on Friday but this is not it, I'm afraid.