single-cell-best-practices icon indicating copy to clipboard operation
single-cell-best-practices copied to clipboard

Integration chapter improvements

Open Zethson opened this issue 3 years ago • 9 comments

Taken from @lazappi great list of TODOs in his PR with some additions of mine:

Todo:

  • [ ] Decide on the "batch correction"/"data integration" terminology
  • [ ] Update the methods figure
    • [ ] Add a separate panel for graph methods
    • [ ] Add a diagram for the global methods
    • [ ] Add citation for the MNN diagram
    • [ ] Consider example methods
    • [ ] Add caption
  • [ ] Resolve other comments on the text in the Google doc
  • [ ] Check for consistency with other chapters

Waiting on something else:

  • [x] Add instructions for getting the dataset and check loading is correct

Merged in PR #109:

  • [x] Create a minimal Conda environment file and not the massive dump that we have right now.
  • [x] Update environment to use latest R (pending fixes to anndata2ri (https://github.com/theislab/anndata2ri/issues/63, https://github.com/theislab/anndata2ri/pull/71) and scib (https://github.com/theislab/scib/issues/322))
  • [x] Double-check the correct batch/label keys are used
  • [x] Add a note on scalability
  • [x] Check that the epochs heuristic is still recommended by scvi-tools authors (even better get them to recommended best practices for early stopping)
  • [x] See if densifying matrices can be avoided when passing data to Seurat (this worked previously but not with the current environment, might be fixed by updating rpy2)
  • [x] Add references/links
  • [x] Check if there is a way to avoid re-normalising (HVGs failed without doing this). If not update to match what is done in pre-processing chapters (if feasible). (I got errors without re-running this, I think it's fine because we do batch-aware HVGs later anyway as already discussed)
  • [x] Check the scIB summary scores (make sure that only scores relevant to each output are being used)
  • [x] Add the introduction from @LuckyMD (still some unresolved comments but most of it is there)

Already merged:

  • [x] Three takeaways

Zethson avatar Aug 02 '22 12:08 Zethson

Opened a draft PR #109 with the working version of the updates. I'm going to use the checklist here to keep track of what I have done there.

lazappi avatar Oct 27 '22 10:10 lazappi

Notes on some of the points, mostly for reference but I think some of them need discussion.

Done

  • The think the point about the batch label might have been relevant to the old dataset, given there is a column called "batch" here I think it makes sense to use that (unless there are other suggestions)
  • Added a note on scalability, mostly to point out that while scVI is slow in the example it scales well
  • I changed the samples that were used, mostly to make things faster. Output is still pretty similar.
  • The epochs heuristic is built into scVI now so that is used automatically, I still use the heuristic we had from scIB for scANVI
  • The new environment seems to have fixed the issue with transferring sparse matrices to/from R

Todo

  • @Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.
  • I haven't made any changes to the normalisation yet, it wasn't quite clear to me what we were suggesting in the normalisation chapter
  • For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.

lazappi avatar Oct 28 '22 15:10 lazappi

@Zethson I don't think you have decided how to handle datasets yet? I'll leave that until we know what the solution is.

Not finally at least.That's fine with me.

For HVGs I think it makes sense to use scanpy because it has batch-aware functionality (added some text about that). We could change to what is suggested in the pre-processing section but it will be some messing around.

That's fine! Keep it as you have it.

Zethson avatar Oct 28 '22 15:10 Zethson

@LuckyMD The main thing left is your intro, I'm pretty happy now with everything else. It would be great to have it for this PR so we can have a complete chapter.

lazappi avatar Nov 08 '22 11:11 lazappi

I think my intro is pretty much done... i worked on this last week in another PR (also worked on references).

LuckyMD avatar Nov 08 '22 13:11 LuckyMD

Ok, cool, I missed that on #87. @Zethson how do you want to handle merging these? Do you want to merge them both separately or do you want me to merge Malte's changes into #109 first (which I think is more up to date with the main branch) so that you only have to deal with one PR? I'm guessing there will be a few conflicts to sort out so I'm happy to do that if it makes things easier for you (also not sure if there is a good way to merge Jupyter notebooks or if it's a copy/paste situation).

lazappi avatar Nov 09 '22 10:11 lazappi

Merging notebooks is always hell. I think it would be easiest for me if you merge Malte's text into your notebook and we merge yours.

Zethson avatar Nov 09 '22 10:11 Zethson

Cool, let's do it that way then. Because it's just the intro text I'm thinking it will be easier to copy-paste rather than to do an actual git merge.

lazappi avatar Nov 09 '22 10:11 lazappi

I did more than only text edits in #87 though... also documentation of code and references throughout.

LuckyMD avatar Nov 09 '22 11:11 LuckyMD