scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

CITE-seq improvements

Open ivirshup opened this issue 4 years ago • 13 comments

This is a bit of a catch all to improve cite-seq support. Currently a dependency of https://github.com/theislab/scanpy-tutorials/pull/14. I'll write a bit more about this after some sleep.

The main goals here are:

  • Implement analysis functions for cite-seq data (like joint clustering, geometric mean normalization, etc.)
  • Improve generality of existing APIs. Basically, if the antibody counts are in obsm, we should be able to apply most functions to that. This is not currently the case.

ivirshup avatar Mar 17 '20 14:03 ivirshup

I just issued (and soon closed) #1107 about the same idea. Wouldn’t be better to support more than two graphs in the multiplex? leidenalg functions work perfectly in case of 3+ graphs and it is extremely fast.

dawe avatar Mar 17 '20 22:03 dawe

thanks for getting this started!

since this new modality has different signal characteristics, I wanted to bring up for discussion:

normalization choice: for the incoming geometric normalization, any justification for choosing that one over others?

in your https://github.com/theislab/scanpy-tutorials/pull/14 (10x PBMC dataset of ~30 Totalseq antibodies), the antibody panel is similar to that used in mass cytometry datasets, but different papers seem to prefer different transforms -- which begs the question, now that similar panels are being used, which transform makes the most sense in terms of:

  • preserving visual interpretation of absent/low/med/high (corresponding to expectations of cell subsets)
  • handling a variety of marker distribution shapes (unimodal/bimodal/trimodal, skewed shapes)
  • making it easier to spot nonspecific antibody staining / off-target effects
  • not introducing more bias in downstream differential comparisons (fits with assumptions about variable distribution properties, based on the commonly used statistical testing methods)

absent a convincing answer, it may be worth implementing multiple as options, leaving the choice to the user, and just documenting these use-cases through citations; eventually, someone can make a notebook that compares the behaviors, biological expectations, and/or impacts on statistical comparisons to inform which method should be the default. While the CITEseq paper applied CLR, it's not obvious that one is better than the ones used in more time-tested fields like mass cytometry and flow cytometry.

def CLR_transform(df):
    '''
    implements the CLR transform used in CITEseq (need to confirm in Seurat's code)
    https://doi.org/10.1038/nmeth.4380
    '''
    logn1 = np.log(df + 1)
    T_clr = logn1.sub(logn1.mean(axis=1), axis=0)
    return T_clr

def asinh_transform(df, cofactor=5):
    '''
    implements the hyperbolic arcsin transform used in CyTOF/mass cytometry
    https://doi.org/10.1038/nmeth.4380
    '''
    T_cytof = np.arcsinh(df / cofactor)
    return T_cytof

def geometric_transform(df):
    '''
    implements the scanpy transform originating from ivirshup:multimodal
    '''
    from scipy.stats.mstats import gmean
    T_geometric = np.divide(df, gmean(df + 1, axis=0))
    return T_geometric

#optionally, for each of these, similar to some cytof workflows, 
#anchor 1-99% quantiles to 0-1, to rescale distribution within a standardized range
#use quantiles as a simple heuristic, due to extreme signal outliers that throw off the scale
#can also floor/ceil, depending on whether values beyond 0-1 are compatible or meaningful

jfx319 avatar May 29 '20 13:05 jfx319

Here's some initial distribution plots for comparison:

Legend:

  • red lines are 0.5% and 99.5% quantiles, a trick used in some cytof papers to deal with extreme outliers (believed to be technical artifacts); here, I simply use it to move the bulk of the data in visible range for the first two plots; while the values are merely a heuristic, the spirit of it follows nonparametric statistics so is pretty reliable in practice

raw (ADT counts):

image

geometric mean (as used in Issac's notebook)

image

seems to only changes the scale, not the shape, so unless I made an error in implementation... it's probably not useful

simple log(n+1) (as used in RNAseq)

image

can suffer from discretization at low values...

note: even though Seurat/Scanpy/Loupe all use different bases, the log base doesn't really matter; it just changes the scale, not the shape/distinguishing power

hyperbolic arcsin (as used in CyTOF)

image

not as noisy as log at low values, and doesn't assert that zeros have to be Laplace smoothed with a pseudocount of +1

biexponential family (as used in flow cytometry)

image

best smoothing so far in the low counts, because that's what it was designed to do

in this case, it is the newest of this family: vlog(alpha=0, beta=12, xmax=70000, zmax=1)

  • https://doi.org/10.1002/cyto.a.23017
  • https://doi.org/10.1002/cyto.a.22030
  • https://doi.org/10.1002/cyto.a.20258

centered log ratio (as used in CITEseq paper)

image

not only does this have good smoothing, but it differs in that it is injecting an additional aspect beyond just bringing high values into a linear range; specifically, the centering feature seems to impart an assumption about compositional data, giving higher preference to relative ratios, even if the absolute magnitude might be different -- this has the effect of counteracting cell size, but I've observed that it may introduce unexpected changes (not shown here) in the shape of the distribution that is different from all of the other transforms mentioned, so these need to be validated biologically against some "ground truth"

for the time being though, the last few mentioned are all good candidates to include as transform options

jfx319 avatar May 31 '20 06:05 jfx319

I've observed that it may introduce unexpected changes (not shown here) in the shape of the distribution that is different from all of the other transforms mentioned, so these need to be validated biologically against some "ground truth"

Here's some selected examples (skipping the raw and geometric mean for reasons stated earlier) of the additional aspect introduced by CLR, beyond linearization of the signal, which illustrate how one might want to decide on a case by case basis which is biologically true:

Some potential artifacts:

  • discreteness at low values (reflected in the histograms earlier), and a "kink" near there in the contour that doesn't match with a 2D-gaussian
  • skewing of the "absence" of a marker depending on presence of another marker
  • a weird double-positive tail that extends along the diagonal

These types of effects are reminiscent of flow cytometry artifacts. However, without proving which one is ground truth, we don't know for sure which one is true. At least initially, I would think that the CLR plots look more plausible.

image

image

image

I used a neutral word earlier: that CLR "injects" additional changes, but now it seems that may be a positive thing because many of these empirical cases seem believable from a biological standpoint -- a more systematic validation/comparison might conclude that it "corrects" some aspect of the signal acquisition (e.g. combats protein differences simply due to cell size). Again, this is because by design, CLR isn't just a rescaling: it performs cell-specific centering relative to all markers in a relative ratio way, so doesn't preserve a 1-to-1 monotonic mapping as a rescaling function like log, asinh, biexponential/logicle/vlog would. But without having tested it in all cases, it's not clear that it will always be better with this kind of assumption for other types of markers that may have different fundamental characteristics. I would recommend that people plot both ways and decide on a case-by-case basis for each marker.

EDIT: I looked around a bit more in the literature and do think that the absolute count based transforms (i.e. all the ones not the CLRatio based), do seem to represent physical reality more: cell size (as one explanation). For example, the CD4intermediate/CD3up_skewed include classical monocytes (and might be larger than the CD4negative/CD3negative); while the CD16high/CD3down_skewed include a nonclassical monocyte subset (and might be smaller cellsize, one example in mouse Fig2). While CLR makes it easier for biologists to intuitively grasp "negative/positive" cell types for a particular marker without having to worry about subtle shifts in intensity (and therefore would be a better first transformation to more easily interpret data), there will still be utility for a sophisticated algorithm to take advantage of those subtle intensity shifts in building more confidence about a celltype identity, especially since those subtleties seem plausible (not technical artifacts).

TLDR; these transforms are better for different use-cases, so having a variety will be helpful. Probably lean towards CLR as the default, but explictly document why/when it won't be useful for finding absolute signal subtleties.

jfx319 avatar May 31 '20 18:05 jfx319

Thoughts on Seurat's solution to multimodal datasets? Weighted nearest neighbors

I think it's a bit more sophisticated (though similar in spirit) than finding the union of the two graphs (as suggested in your tutorial).

I really like the idea of using leiden's multiplex partition function. It's cool that leiden clustering on a multiplex graph gives really similar results to clustering on the joint graph.

atarashansky avatar Feb 10 '21 20:02 atarashansky

@ivirshup Is the code for all the normalisations (including quantile rescaling) available somewhere?

Hrovatin avatar Feb 11 '21 08:02 Hrovatin

I don't have an opinion on normalisation methods, but it would be great if there are plans to merge this PR to provide initial CITE-seq support?

chris-rands avatar Mar 15 '21 10:03 chris-rands

Very interesting discussion. Can someone help me to interpret the axis of the centered log ratio (as used in CITEseq paper)? I see that each Antibody has a different scale. How to compare them? Thanks

llumdi avatar Jun 17 '21 09:06 llumdi

Has CITE-seq support in scanpy been abandoned? Would you recommend muon instead? https://muon-tutorials.readthedocs.io/en/latest/cite-seq/1-CITE-seq-PBMC-5k.html

chris-rands avatar Jan 21 '22 16:01 chris-rands

Has CITE-seq support in scanpy been abandoned? Would you recommend muon instead? https://muon-tutorials.readthedocs.io/en/latest/cite-seq/1-CITE-seq-PBMC-5k.html

You can still use the CLR transformation function provided in the previous posts to transform the data and perform the rest of the analysis steps using scanpy. Alternatively, you can use the clr transform function from scikit-bio (http://scikit-bio.org/docs/0.4.1/generated/generated/skbio.stats.composition.clr.html). Either way, you will need to convert the adata to pandas dataframe using the to_df() function.

ghar1821 avatar Feb 28 '22 22:02 ghar1821

Has CITE-seq support in scanpy been abandoned? Would you recommend muon instead? https://muon-tutorials.readthedocs.io/en/latest/cite-seq/1-CITE-seq-PBMC-5k.html

Yes. We are now officially supporting muon and recommend the usage of muon.

Zethson avatar Apr 22 '22 20:04 Zethson

Hi! Trying to follow the Muon tutorial, I get stuck on: pt.pp.dsb(mdata, raw=mdata_raw, empty_droplets=droplets)

NameError: name 'droplets' is not defined

Can anyone help?

livyring avatar Aug 03 '22 16:08 livyring

Hi! Trying to follow the Muon tutorial, I get stuck on: pt.pp.dsb(mdata, raw=mdata_raw, empty_droplets=droplets)

NameError: name 'droplets' is not defined

Can anyone help?

@livyring this doesn't belong here. Please ask usage questions here: https://discourse.scverse.org/

Zethson avatar Aug 03 '22 16:08 Zethson

@ivirshup @gtca should we close this PR since we have muon now?

Zethson avatar Jan 03 '23 12:01 Zethson