[r] add iterative lsi design docs
Pretty much as we discussed during call!
I think the biggest point of contention is the normalization structure. Normalizations like tf-idf, Z-score norms can have data that they will be fit to. However, the biggest problem is that we want something that can interoperate with BPCells operations, while also return the calculated information (mean, variance, idf). Should it follow the same styling as the S3 class for LSI that we are creating, with cell.embeddings/feature.loadings?
I propose just having a boolean param, with the default returning an IterableMatrix, and the other being an option to return a class that we can project with.
Thanks for this update, very clearly written
Discussion points:
- New idea for the base name, maybe
select_features_[method]instead ofvariable_features_[method]? I think this might side-step some confusion that some selection methods won't be based on variance, and has the benefit of actually using a verb. Curious what you think
Comments:
- I think
meanmight be a better general name than_by_top_accessibility-- I think the calculation would be basically the same - for
_by_cluster_variance, I thought we discussed that we might not make this helper method layer, and just do the pseudobulking in the iterative lsi function- That said, we could probably use functions for
_varianceand_dispersion(variance divided by mean)
- That said, we could probably use functions for
- As discussed on slack, I think
normalize_tfidfshould have afeature_meansorfeature_sumsargument. We won't return any extra data and rely on the clarity of the argument name for pepole to know - I think we can ditch
feature.loadingsfor now from the Seurat interface, and maybe renamecell.embeddingstocell_embeddingsfor better stylistic consistency within BPCells. We can addfeature.loadingsback in later if we decide it would be useful
select_features is good!
For your second point comment point, I think my interpretation from that conversation was that the clustering would be separate, but we would still pseudobulk within funciton. By all means, we can take that out and put it into the wrapper iterative lsi function though! Do you think it would make sense to do variance and dispersion as just a parameter in the same function instead, rather than separate functions?
Overall I agree with your other points. Will reflect here soon!
I was originally thinking clustering + pseudobulk calculation could happen in iterative_lsi. Then we still have a parameter in iterative_lsi to let the user configure how feature selection happens (which could be e.g. select_features_variance(normalize=normalize_log) by default). Then the two functions I mentioned would just be general options for variable feature selection, implemented outside of iterative lsi. I think having at least a select_features_variance function separate is a good idea; still OK to have a select_features_by_pseudobulk method if you prefer to implement it that way -- not a big deal either way