peppy
peppy copied to clipboard
Sample name uniqueness?
It's mentioned in the special columns section of the project definition docs that sample_name
should be a unique identifier, but I don't think that we enforce the uniqueness constraint, and I seem to be able to tweak the microtest
annotations to have duplicate sample names and get no error. I think this really isn't a major issue since I think that some protocol data is embedded in the sample name, and often there may only be one protocol/pipeline per "sample" row in the sheet anyway.
The scenario I'm envisioning is one in which someone creates a sheet with two entries with the same name, but protocols are different. Say, old_mouse_A
with ATAC
as protocol, and and old_mouse_A
with WGBS
as protocol. I think then that the results would just wind up in the same output folder, but otherwise this may not be problematic? Does anyone know top-of-mind whether output from these hypothetical samples would in fact be destined for the same folder? If so, is this the desired behavior, or do we want to differentiate on protocol?
I originally typed out this response, but now I'm rethinking:
The
sample_name
at one point was forced to be unique. If that has been lost, I think we should reinstate it. I think it's important to enforce that because each sample is typically put in a sample folder, and we use that as a unique handle for the sample in lots of downstream cases.
I think at least it should be a warning...but maybe we don't have to enforce it...
Ok, maybe that's a good sign that it's infrequent enough to have come up in a while? Seems like a pretty rare case but it just came up as I was tinkering earlier today. Either way seems OK to me (true exception or just warning) so I'm fine with whatever you think.
Thoughts @afrendeiro ?
This is currently implemented as a warning message, FYI. The check happens when the base collection of Sample
instances for the Project
is first created, which by default (defer_sample_construction=False
) is at the end of the constructor.
Yeah, I think the current implementation is good. It actually happens to me often in big projects that I end up with duplicate sample names but this is always by mistake. Maybe the behaviour could be that an error is thrown when duplicate sample name have the same library? This would prevent submission of two pipelines overwriting each other. Otherwise simply a warning in the end of the submission loop has generally been sufficient for me to spot the mistake.
I think we shouldn't go down to library/protocol, since that really depends on how pipelines are implemented...
This reminds me of #169.
For now let's just leave it as a warning, I guess. Eventually, let's implement a 'health checker' that people can run (and maybe even looper will run by default to check for critical warnings).
But if this bites someone we could re-open this issue and elevate that to a failure.
Sounds good
New idea: if duplicate names are detected, move that column to sample_name_orig
, and then generate a unique sample_name
(based on the original one, maybe postpending _dupe1
), and provide a warning.