Addressing author-provided batch info
As of right now the wiki states to add the 'block' EFC when users submit their own batch info, but if batch information is refreshed, the manually-added EFC gets deleted from the curated design, and the batches from our own automatic batch detection get used instead.
For experiments like GSE243816, user submitted batch information was provided, which differs from our own batch detection (no batch info in this case). Refreshing batch info would override the 'block' EFC a curator adds. This seems to be rare, but has happened in other scenarios too like in GSE189788, where the user submitted batch info did not align with our own batches.
As mentioned, it is very, very rare that GEO submitters provide batch information, but it may become more common.
There are two related situations here I believe. In one we have two sources of batches and we want to maintain both, but only one would be used for batch correction; typically, those would be our own batch calls (in the case mentioned, the authors' batching seems to be in error, otherwise we'd use those)
The other is when the only usable source of batch information is coming from the authors, so if we use that, we must avoid clobbering it by accident. In the first case mentioned above, Gemma ends up with singleton batches, but since the provider declares batches too, we should just use them (I don't think they are in error; our batching goes by device in this case and their batches are similar but pool some devices, so they are probably going by processing date or something like that).
I'm thinking that we could distinguish submitter-provided and Gemma-provided batch information by an appropriate naming convention so we can tell them apart, at the very least. That would get us partway there (not foolproof), we could have some logic to require a special override to replace submitter-provided with Gemma-provided. Having more safeguards to avoid overwriting existing batch information as well.
This is also a thing for cell type factors that may get regenerated upon importing a new cell type assignment.
The best approach would be to identify whether a factor was created manually (by a curator) or imported from a data file and make sure that in those cases, we don't override it. This could be don't by adding a flag in ExperimentalFactor.
GSE261817 also has author-provided batch info.
The solution for this is in https://github.com/PavlidisLab/Gemma/issues/1498. I intend to simply not remove factors that are marked as "manually curated".