rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

More flexible samplesheet

Open grst opened this issue 4 years ago • 12 comments

I think it would be great if the samplesheet could handle additional columns (which would be ignored by the RNA-seq pipeline, but would be used by some downstream analysis). Keeping these in a separate file leaves potential for human errors.

It would also be cool if the columns could be in arbitrary order.

To this end, it could maybe help to either use the stdlib csv parser or pandas read_csv instead of manually splitting the lines.

Potentially, these libraries would also be immune to "hidden carriage returns" (what came up on slack recently) or other caveats like quoted fields etc.

grst avatar Nov 24 '20 07:11 grst

The pipeline should be able to handle additional columns already @grst.

Not convinced arbitrary order is a good thing though mainly for standardisation but happy to be persuaded otherwise.

Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;) I was scrimping on having additional dependencies which is why I also went for a native solution but that shouldn't be a valid reason!

drpatelh avatar Nov 24 '20 07:11 drpatelh

The pipeline should be able to handle additional columns already @grst.

Cool! I could have sworn there was an issue the last time I tried, but probably I didn't add the additional columns at the end. I'll try it again next time!

Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;)

I can definitely put something together when I find some time. I understand the concern with the dependencies. csv is in the standard library, so it wouldn't need any dependencies either. I have never used it in favor of pandas, though. It's probably slower than pandas, but that shouldn't be a concern for samplesheets.

grst avatar Nov 24 '20 08:11 grst

Some more thoughts:

  • Could this be abstracted into a module that can be reused across pipelines, or is that overkill? Maybe even integrate the samplesheet definition in the json schema :thinking:?
  • Maybe it should be implemented directly in groovy instead of Python? (I hardly know any groovy, but this could be an excuse for learning it)

grst avatar Nov 24 '20 08:11 grst

Anything fancy we can do here @grst to improve our current implementation? We have now added a stand-alone samplesheet schema to help with the validation in https://github.com/nf-core/rnaseq/pull/623 but there are still some fundamental issues that may require outside scripts for this - some discussion in https://github.com/nf-core/rnaseq/pull/633

drpatelh avatar Jun 15 '21 12:06 drpatelh

I'll take a look at the issues you referenced... First step would be a reproducible example for the utf8-bug, though :thinking:

grst avatar Jun 15 '21 12:06 grst

Yeah, I know. That 🐛 is really annoying and I suspect it will be quite an easy fix. Just haven't found a portable way to test it!

drpatelh avatar Jun 15 '21 12:06 drpatelh

The json-scheme and groovy-based validation look neat! That would make the python script obsolete anyway?

You mentioned the stripping of quotes is done in python, but there should be some way of achieving this in Groovy?

header = [x.strip('"') for x in fin.readline().strip().split(",")] 

grst avatar Jun 16 '21 07:06 grst

Yup, most of the Python stuff would be obsolete!

I think the problem is that the validation needs to happen with standard libraries before it is passed to the pipeline in order to make it more portable but it doesn't have this option. Maybe @ewels can confirm.

drpatelh avatar Jun 16 '21 16:06 drpatelh

Additional java/groovy jars could be placed in lib as part of the pipeline template...

grst avatar Jun 16 '21 16:06 grst

@grst maybe you could take a look at https://github.com/nf-core/tools/pull/1282 and see if that would fit your needs? I agree that doing the job in Groovy is preferable and consider my PR only a temporary change if it is at all wanted.

Midnighter avatar Oct 08 '21 19:10 Midnighter

Looks neat! I lost track of the discussion around the "samplesheet schema". Is your implemenation already part of that or would the schema validation be the next step?

grst avatar Oct 11 '21 07:10 grst

@ewels showed my his work on the samplesheet schema on Friday but this is not it, I'm afraid.

Midnighter avatar Oct 11 '21 07:10 Midnighter