single-cell-best-practices Integration chapter improvements

Taken from @lazappi great list of TODOs in his PR with some additions of mine:

Todo:

[ ] Decide on the "batch correction"/"data integration" terminology
[ ] Update the methods figure
- [ ] Add a separate panel for graph methods
- [ ] Add a diagram for the global methods
- [ ] Add citation for the MNN diagram
- [ ] Consider example methods
- [ ] Add caption
[ ] Resolve other comments on the text in the Google doc
[ ] Check for consistency with other chapters

Waiting on something else:

[x] Add instructions for getting the dataset and check loading is correct

Merged in PR #109:

[x] Create a minimal Conda environment file and not the massive dump that we have right now.
[x] Update environment to use latest R (pending fixes to anndata2ri (https://github.com/theislab/anndata2ri/issues/63, https://github.com/theislab/anndata2ri/pull/71) and scib (https://github.com/theislab/scib/issues/322))
[x] Double-check the correct batch/label keys are used
[x] Add a note on scalability
[x] Check that the epochs heuristic is still recommended by scvi-tools authors (even better get them to recommended best practices for early stopping)
[x] See if densifying matrices can be avoided when passing data to Seurat (this worked previously but not with the current environment, might be fixed by updating rpy2)
[x] Add references/links
[x] Check if there is a way to avoid re-normalising (HVGs failed without doing this). If not update to match what is done in pre-processing chapters (if feasible). (I got errors without re-running this, I think it's fine because we do batch-aware HVGs later anyway as already discussed)
[x] Check the scIB summary scores (make sure that only scores relevant to each output are being used)
[x] Add the introduction from @LuckyMD (still some unresolved comments but most of it is there)

Already merged:

[x] Three takeaways

Aug 02 '22 12:08 Zethson

Opened a draft PR #109 with the working version of the updates. I'm going to use the checklist here to keep track of what I have done there.

Oct 27 '22 10:10 lazappi

Notes on some of the points, mostly for reference but I think some of them need discussion.

Done

The think the point about the batch label might have been relevant to the old dataset, given there is a column called "batch" here I think it makes sense to use that (unless there are other suggestions)
Added a note on scalability, mostly to point out that while scVI is slow in the example it scales well
I changed the samples that were used, mostly to make things faster. Output is still pretty similar.
The epochs heuristic is built into scVI now so that is used automatically, I still use the heuristic we had from scIB for scANVI
The new environment seems to have fixed the issue with transferring sparse matrices to/from R

Todo

@Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.
I haven't made any changes to the normalisation yet, it wasn't quite clear to me what we were suggesting in the normalisation chapter
For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.

Oct 28 '22 15:10 lazappi

@Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.

Not finally at least.That's fine with me.

For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.

That's fine! Keep it as you have it.

Oct 28 '22 15:10 Zethson

@LuckyMD The main thing left is your intro, I'm pretty happy now with everything else. It would be great to have it for this PR so we can have a complete chapter.

Nov 08 '22 11:11 lazappi

I think my intro is pretty much done... i worked on this last week in another PR (also worked on references).

Nov 08 '22 13:11 LuckyMD

Ok, cool, I missed that on #87. @Zethson how do you want to handle merging these? Do you want to merge them both separately or do you want me to merge Malte's changes into #109 first (which I think is more up to date with the main branch) so that you only have to deal with one PR? I'm guessing there will be a few conflicts to sort out so I'm happy to do that if it makes things easier for you (also not sure if there is a good way to merge Jupyter notebooks or if it's a copy/paste situation).

Nov 09 '22 10:11 lazappi

Merging notebooks is always hell. I think it would be easiest for me if you merge Malte's text into your notebook and we merge yours.

Nov 09 '22 10:11 Zethson

Cool, let's do it that way then. Because it's just the intro text I'm thinking it will be easier to copy-paste rather than to do an actual git merge.

Nov 09 '22 10:11 lazappi

I did more than only text edits in #87 though... also documentation of code and references throughout.

Nov 09 '22 11:11 LuckyMD

single-cell-best-practices single-cell-best-practices copied to clipboard

Integration chapter improvements

single-cell-best-practices
single-cell-best-practices copied to clipboard