clustifyr icon indicating copy to clipboard operation
clustifyr copied to clipboard

Create signature matrix with average_clusters() using bulkRNAseq data

Open saphir746 opened this issue 1 year ago • 1 comments

hello,

I am trying to create a cell-type signature matrix from bulkRNAseq of FACS sorted mono-cell-types samples:

expr_matrix %>% head()
  Sample_1 Sample_2 Sample_3 Sample_4
TSPAN6 0.6621047 0.6621047 8.4720554 0.6621047
TNMD 0.6621047 0.6621047 2.771366 6.9039605
DPM1 9.2066392 8.8886292 0.6621047 10.17191
SCYL3 5.5968998 3.0201094 9.9043603 8.514964
C1orf112 3.6115171 7.806794 5.371021 4.5565736
design_matrix %>% head()
  Sample_name Sex Cell_type
Sample_1 F Neutrophils_BoneMarrow
Sample_2 F MCs_BoneMarrow
Sample_3 M Neutrophils_BoneMarrow
Sample_4 M MCs_BoneMarrow

whereby the columns represent samples (different patient samples) and rows are annotated GrCh38 gene names. expr_matrix is derived from raw counts, after alignment to GRch38 using the standard nf-core/RNAseq pipeline, and then normalised using varianceStabilizingTransformation() in DESeq2 I understand that in Clustifyr the best approach would be to use average_clusters() (?)

new_ref_matrix <- clustifyr::average_clusters(
  mat = expr_matrix,
  metadata = design_matrix$Cell_type,
  cluster_col = "Sample_name",
  method = 'median',
  cut_n = TRUE
)

But then I am unsure if:

  1. What I have done in terms of pre-processing the bulk RNAseq count data is right ?
  2. I'm calling average_clusters() correctly ?
  3. I need to perform some extra scaling / normalisation on the resulting reference matrix new_ref_matrix ?
  4. I can then integrate new_ref_matrix with other reference signature matrices derived from single cell data?

Any advice, insights or lead to helpful material will be very much appreciated

Thanks all

saphir746 avatar Mar 13 '23 11:03 saphir746

Thanks for your interest in the package:

  1. In general we recommend using similar normalization approaches between the reference data the scRNA-seq dataset. So I would recommend just using log-transformed normalized counts from DESeq2. I haven't tried using transformed counts (from varianceStabilizingTransformation() or rlog()) however I would suspect that you would see similar results to using log-transformed normalized counts.

  2. I wouldn't recommend using the cut_n parameter as it is a crude method to exclude low abundance genes and in most cases isn't necessary. The cluster_col = "Sample_name" is also unnecessary if you pass a vector to metadata.

  3. No additional scaling should be necessary

  4. Yes, you can combine multiple references into the same matrix (making sure that the genes are compatible), or run clustifyr independently for each different reference.

kriemo avatar Mar 15 '23 22:03 kriemo