pycytominer Feature idea- subset before normalization

Briefly: we've run into a situation where we're testing a matrix of conditions [cell type x treatment x dose], and we may want to normalize to one axis of the matrix (ie, cell type) separately within each plate (rather than normalizing to the whole plate or one condition within it) if we discover that one axis of our matrix drives the majority of the difference. I can imagine other "matrix of conditions" cases where this might be useful as well- [cell type x density] , etc.

It seems technically straightforward to implement- essentially, take everything that happens in normalize, and if an optional column name is passed, break the database into subsets based on that column, do all the things normalize does, then rebuild the final dataframe at the end and pass it out. We just need to decide if it's a good idea, which we are currently discussing internally in Slack but could also happen here as well.

Can anyone remember any small completed profiling experiment that we've done before where we had both a) multiple cell types on a plate and b) multiple treatments of each cell type on a plate? I'm asking because we were looking at the GCTs of a project that had both A and B (3 cell types with ~ 30 treatments of each cell type on each plate), and basically the result we got was "cell type similarity swamps out everything else in the normalization", so we ended up with 3 big red blocks on our similarity matrix and nothing else. After we rule out a couple small technical plate-position things, our next idea was just to after annotation split them into per-cell-type "pseudoplates" and then do normalization on each pseudoplate separately, which isn't a technically hard thing to do, I'm just trying to figure out Was this/should this have been expected? Should we plan to always split into "pseudoplates" in the future for experiments that mix both cell type and treatment on the same plate? AKA is this a functionality we should consider building into say, pycytominer, to do normalization separately on subsets? Should we advise collaborators not to do experiments with both mixes of cell type and treatment?

Jul 07 '21 19:07 bethac07

If I'm understanding correctly, you can implement this outside of pycytominer currently.

(
    profile_df
    .groupby(["cell_type", "dose", "treatment"])
    .apply(
        lambda x: normalize(
            profiles=x,
            operation="standardize"
            ...
        )
    )
)

This would be the way to do it if we wanted to build this in the code as well. If we do, we might consider splitting this groupby out into a separate function such that other operations can use it as well.

Jul 07 '21 19:07 gwaybio

Yes, I think you understand it just right and absolutely that would be another way to do it, which we could do at say, the recipe level rather than within pycytominer. It's not always clear to me which functions we might want at each level!

Jul 07 '21 20:07 bethac07

It's not always clear to me which functions we might want at each level!

Same... I think we'll gain clarity once we get more pycytominer / recipe use cases, and we answer questions like: "how often do people use pycytominer outside the recipe?" "is the "traditional" recipe more of an exception than the norm?", "how often do we need to experiment with pycytominer on pilot data before settling on a specific recipe configuration?"

right now my answer is 🤷 in the long run, I agree it would be nice to have this in the recipe, but I think by design the recipe will always lag behind pycytominer. Or, maybe it'll actually drive it's development! fun!

Jul 07 '21 21:07 gwaybio

Bumping this, because we now have had multiple people asking for it and I'm in the process of writing into the recipe, but I do think it might be better to have it in pycytominer itself!

May 17 '23 11:05 bethac07

We're integrating pycytominer into our cell painting scripts and I was just wondering how does normalize actually work. Because from the code it seems that it just normalizes the whole dataset to a subset or whole dataset. If I understand correctly, there is no implementation yet for whole-plate normalization in case of multi-plate experiments?

It seems that this issue addresses this as well - am I right?

Looking at the other issues - #177 might provide some more clarification. If there's any help needed with implementation please let me know - happy to help.

Jun 27 '23 15:06 tilenkranjc