BPCells [r] add iterative lsi design docs

Pretty much as we discussed during call!

I think the biggest point of contention is the normalization structure. Normalizations like tf-idf, Z-score norms can have data that they will be fit to. However, the biggest problem is that we want something that can interoperate with BPCells operations, while also return the calculated information (mean, variance, idf). Should it follow the same styling as the S3 class for LSI that we are creating, with cell.embeddings/feature.loadings?

I propose just having a boolean param, with the default returning an IterableMatrix, and the other being an option to return a class that we can project with.

Dec 12 '24 00:12 immanuelazn

Thanks for this update, very clearly written

Discussion points:

New idea for the base name, maybe select_features_[method] instead of variable_features_[method]? I think this might side-step some confusion that some selection methods won't be based on variance, and has the benefit of actually using a verb. Curious what you think

Comments:

I think mean might be a better general name than _by_top_accessibility -- I think the calculation would be basically the same
for _by_cluster_variance, I thought we discussed that we might not make this helper method layer, and just do the pseudobulking in the iterative lsi function
- That said, we could probably use functions for _variance and _dispersion (variance divided by mean)
As discussed on slack, I think normalize_tfidf should have a feature_means or feature_sums argument. We won't return any extra data and rely on the clarity of the argument name for pepole to know
I think we can ditch feature.loadings for now from the Seurat interface, and maybe rename cell.embeddings to cell_embeddings for better stylistic consistency within BPCells. We can add feature.loadings back in later if we decide it would be useful

Dec 12 '24 01:12 bnprks

select_features is good!

For your second point comment point, I think my interpretation from that conversation was that the clustering would be separate, but we would still pseudobulk within funciton. By all means, we can take that out and put it into the wrapper iterative lsi function though! Do you think it would make sense to do variance and dispersion as just a parameter in the same function instead, rather than separate functions?

Overall I agree with your other points. Will reflect here soon!

Dec 12 '24 01:12 immanuelazn

I was originally thinking clustering + pseudobulk calculation could happen in iterative_lsi. Then we still have a parameter in iterative_lsi to let the user configure how feature selection happens (which could be e.g. select_features_variance(normalize=normalize_log) by default). Then the two functions I mentioned would just be general options for variable feature selection, implemented outside of iterative lsi. I think having at least a select_features_variance function separate is a good idea; still OK to have a select_features_by_pseudobulk method if you prefer to implement it that way -- not a big deal either way

Dec 12 '24 02:12 bnprks