single-cell-best-practices
single-cell-best-practices copied to clipboard
Integration chapter improvements
Taken from @lazappi great list of TODOs in his PR with some additions of mine:
Todo:
- [ ] Decide on the "batch correction"/"data integration" terminology
- [ ] Update the methods figure
- [ ] Add a separate panel for graph methods
- [ ] Add a diagram for the global methods
- [ ] Add citation for the MNN diagram
- [ ] Consider example methods
- [ ] Add caption
- [ ] Resolve other comments on the text in the Google doc
- [ ] Check for consistency with other chapters
Waiting on something else:
- [x] Add instructions for getting the dataset and check loading is correct
Merged in PR #109:
- [x] Create a minimal Conda environment file and not the massive dump that we have right now.
- [x] Update environment to use latest R (pending fixes to anndata2ri (https://github.com/theislab/anndata2ri/issues/63, https://github.com/theislab/anndata2ri/pull/71) and scib (https://github.com/theislab/scib/issues/322))
- [x] Double-check the correct batch/label keys are used
- [x] Add a note on scalability
- [x] Check that the epochs heuristic is still recommended by scvi-tools authors (even better get them to recommended best practices for early stopping)
- [x] See if densifying matrices can be avoided when passing data to Seurat (this worked previously but not with the current environment, might be fixed by updating rpy2)
- [x] Add references/links
- [x] Check if there is a way to avoid re-normalising (HVGs failed without doing this). If not update to match what is done in pre-processing chapters (if feasible). (I got errors without re-running this, I think it's fine because we do batch-aware HVGs later anyway as already discussed)
- [x] Check the scIB summary scores (make sure that only scores relevant to each output are being used)
- [x] Add the introduction from @LuckyMD (still some unresolved comments but most of it is there)
Already merged:
- [x] Three takeaways
Opened a draft PR #109 with the working version of the updates. I'm going to use the checklist here to keep track of what I have done there.
Notes on some of the points, mostly for reference but I think some of them need discussion.
Done
- The think the point about the batch label might have been relevant to the old dataset, given there is a column called
"batch"here I think it makes sense to use that (unless there are other suggestions) - Added a note on scalability, mostly to point out that while scVI is slow in the example it scales well
- I changed the samples that were used, mostly to make things faster. Output is still pretty similar.
- The epochs heuristic is built into scVI now so that is used automatically, I still use the heuristic we had from scIB for scANVI
- The new environment seems to have fixed the issue with transferring sparse matrices to/from R
Todo
- @Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.
- I haven't made any changes to the normalisation yet, it wasn't quite clear to me what we were suggesting in the normalisation chapter
- For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.
@Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.
Not finally at least.That's fine with me.
For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.
That's fine! Keep it as you have it.
@LuckyMD The main thing left is your intro, I'm pretty happy now with everything else. It would be great to have it for this PR so we can have a complete chapter.
I think my intro is pretty much done... i worked on this last week in another PR (also worked on references).
Ok, cool, I missed that on #87. @Zethson how do you want to handle merging these? Do you want to merge them both separately or do you want me to merge Malte's changes into #109 first (which I think is more up to date with the main branch) so that you only have to deal with one PR? I'm guessing there will be a few conflicts to sort out so I'm happy to do that if it makes things easier for you (also not sure if there is a good way to merge Jupyter notebooks or if it's a copy/paste situation).
Merging notebooks is always hell. I think it would be easiest for me if you merge Malte's text into your notebook and we merge yours.
Cool, let's do it that way then. Because it's just the intro text I'm thinking it will be easier to copy-paste rather than to do an actual git merge.
I did more than only text edits in #87 though... also documentation of code and references throughout.