sarek icon indicating copy to clipboard operation
sarek copied to clipboard

Make patient optional in the samplesheet

Open saulpierotti opened this issue 2 years ago • 10 comments

Description of feature

I often work with non-human germline data and in sarek it is required to specify a patient ID and a sample ID in the samplesheet. The IDs that are then written to the cram and VCF files are in the form patient_sample. In my case the concept of patient does not really apply and I just set patient and sample to the same value. However, this results in having IDs in the vcf and cram in the form ID_ID, while I would want them to be jus§t ID.

I suggest to make patient optional when running germline variant calling.

saulpierotti avatar Jul 11 '23 12:07 saulpierotti

The way I see it, if samples are unique (or at least sample/lane combination is unique). We can indeed have patient as optional. And in that case, output could be meta.id based instead of meta.patient+meta.id based

maxulysse avatar Jul 11 '23 13:07 maxulysse

or we could rename patient to entity?

FriederikeHanssen avatar Jul 12 '23 10:07 FriederikeHanssen

or we could rename patient to entity?

Yeah, "entity" might be better. I was a bit puzzled by the choice of the column name "patient", since the sample doesn't have to come from a patient. (Although I guess it is suitable for most users.)

asp8200 avatar Jul 12 '23 11:07 asp8200

Renaming patient to entity I think makes sense but would not solve the sample naming issue in the outputs right? If entity is not optional the samples in the output will still be named entity_sample even for germline calling where this does not really make sense. Basically if I have a sample ID I would like to be able to preserve such sample ID in the crams and VCFs

saulpierotti avatar Jul 12 '23 11:07 saulpierotti

I'm in the middle of exploring sarek for our (parasite) population genetics projects and I had exactly the same remark.

I'd be in favor of both options. I can image scenarios where multiple samples can come from the same subject (whether this be a patient, collection site, etc.), as well as other settings where sample is a unique and top-level identifier that does not need to be nested further.

The former would make the pipeline more general for analyses that do not involve patient data, but do have another grouping factor. I'd argue that subject is more generally applicable than entity for this case, but I guess it could depend on your field.

The latter would fix the double naming issue for situations where sample is the sole identifier and is guaranteed to be unique. It would also make the sarek pipeline more similar to the rnaseq one (which only relies on sample,fastq_1,fastq_2,strandedness).

Another example where patient ID does not always make sense, is when (re)-analyzing reads from SRA, especially when done in conjunction with https://github.com/nf-core/fetchngs. The fetchngs pipeline outputs an nf-core-styled samplesheet.csv, but it is currently only optimized for use with a few other nf-core pipelines (like nf-core/rnaseq). It seems to use a sample column to store the SRA experiment and a run column to store the SRA run, although in typing this I wonder why the BioSample identifier isn't being used as a sample column instead.

pmoris avatar Jul 19 '23 16:07 pmoris

Thanks for the great input @pmoris 🙏 On a side note: making fetchngs directly output a suitable samplesheet is pretty high on our list of features we are working on in the next few months.

FriederikeHanssen avatar Jul 19 '23 16:07 FriederikeHanssen

Did this ever end up going anywhere in regards to fetchngs?

Shaun-Regenbaum avatar Dec 21 '23 10:12 Shaun-Regenbaum

sorry, no time yet. We'll look into it in the new year

FriederikeHanssen avatar Dec 21 '23 11:12 FriederikeHanssen

I can work on it if you want. I am getting into nf-core workflows and contributing.

Shaun-Regenbaum avatar Dec 27 '23 11:12 Shaun-Regenbaum

Yes sounds great. We have a dev channel called #sarek_dev on slack as well if you need help with anything :)

FriederikeHanssen avatar Jan 10 '24 12:01 FriederikeHanssen