peppy Sample name uniqueness?

It's mentioned in the special columns section of the project definition docs that sample_name should be a unique identifier, but I don't think that we enforce the uniqueness constraint, and I seem to be able to tweak the microtest annotations to have duplicate sample names and get no error. I think this really isn't a major issue since I think that some protocol data is embedded in the sample name, and often there may only be one protocol/pipeline per "sample" row in the sheet anyway.

The scenario I'm envisioning is one in which someone creates a sheet with two entries with the same name, but protocols are different. Say, old_mouse_A with ATAC as protocol, and and old_mouse_A with WGBS as protocol. I think then that the results would just wind up in the same output folder, but otherwise this may not be problematic? Does anyone know top-of-mind whether output from these hypothetical samples would in fact be destined for the same folder? If so, is this the desired behavior, or do we want to differentiate on protocol?

Sep 19 '17 21:09 vreuter

I originally typed out this response, but now I'm rethinking:

The sample_name at one point was forced to be unique. If that has been lost, I think we should reinstate it. I think it's important to enforce that because each sample is typically put in a sample folder, and we use that as a unique handle for the sample in lots of downstream cases.

I think at least it should be a warning...but maybe we don't have to enforce it...

Sep 19 '17 23:09 nsheff

Ok, maybe that's a good sign that it's infrequent enough to have come up in a while? Seems like a pretty rare case but it just came up as I was tinkering earlier today. Either way seems OK to me (true exception or just warning) so I'm fine with whatever you think.

Sep 20 '17 00:09 vreuter

Thoughts @afrendeiro ?

Oct 11 '17 11:10 nsheff

This is currently implemented as a warning message, FYI. The check happens when the base collection of Sample instances for the Project is first created, which by default (defer_sample_construction=False) is at the end of the constructor.

Oct 18 '17 21:10 vreuter

Yeah, I think the current implementation is good. It actually happens to me often in big projects that I end up with duplicate sample names but this is always by mistake. Maybe the behaviour could be that an error is thrown when duplicate sample name have the same library? This would prevent submission of two pipelines overwriting each other. Otherwise simply a warning in the end of the submission loop has generally been sufficient for me to spot the mistake.

Oct 19 '17 14:10 afrendeiro

I think we shouldn't go down to library/protocol, since that really depends on how pipelines are implemented...

This reminds me of #169.

For now let's just leave it as a warning, I guess. Eventually, let's implement a 'health checker' that people can run (and maybe even looper will run by default to check for critical warnings).

But if this bites someone we could re-open this issue and elevate that to a failure.

Oct 19 '17 18:10 nsheff

Sounds good

Oct 19 '17 18:10 vreuter

New idea: if duplicate names are detected, move that column to sample_name_orig, and then generate a unique sample_name (based on the original one, maybe postpending _dupe1), and provide a warning.

Apr 19 '19 15:04 nsheff

peppy peppy copied to clipboard

Sample name uniqueness?

peppy
peppy copied to clipboard