scHLApers
scHLApers copied to clipboard
Code to run the scHLApers pipeline for personalized single-cell HLA quantification
scHLApers
Code to run the scHLApers pipeline for quantifying single-cell HLA expression using personalized reference genomes (Kang et al., Nat Genetics 2023).

Requirements
R program requires (listed version or higher):
- R=4.0.5
- Biostrings=2.58.0
- purrr=0.3.4
- readr=2.1.2
- stringi=1.7.8
- stringr=1.4.0
- tidyverse=1.3.1
- rtracklayer=1.50.0
Other software:
- STAR=2.7.10a https://github.com/alexdobin/STAR
- samtools=1.4.1 http://www.htslib.org/download/
Data:
- Reference genome (e.g. GRCh38.primary_assembly.genome.fa): available here
- Gene annotation file (e.g. gencode.v38.annotation.gtf): available here
- Cell barcode whitelist: more info here
Pipeline and example data
Each step has its own directory with necessary scripts and a tutorial walking through the steps. The example_data and example_output directories contain example input and output files for 2 samples. The raw scRNA-seq data for the example was obtained from Yazar et al. Science 2022 study, publicly available on GEO (GSE196830).
Input
The inputs to scHLApers are:
- Raw scRNA-seq data (either FASTQ or BAM format)
- HLA allele calls (in CSV format, labeled as "SampleX_alleles.csv", see
example_data/inputs/allelesfor format)
See the HLA analyses tutorial from Sakaue et al. for protocol for imputing HLA alleles from genotype array data.
Step 1: Prepare HLA allelic sequence database
We provide a pre-prepared database generated from IPD-IMGT/HLA version 3.47 that can be directly used in Step 2. Alternatively, you can prepare your own database using the latest IPD-IMGT/HLA verison following the tutorial.
Step 2: Make personalized reference and annotation files
The tutorial demonstrates how to generate personalized contigs (FASTA) and annotations (GTF) files (that will be combined with the masked reference) and how to mask the reference.
Step 3: Quantify single-cell expression with STARsolo
Example scripts for how to run STARsolo for read alignment and expression quantification in single-cell data. Script will need to be modified based on the specifics of your dataset (e.g. UMI length, input format, barcode whitelist path, STAR executable). Please see the STAR manual for all options.
Outputs
The output of scHLApers is a genes by cells expression matrix, with improved classical HLA expression estimates. In the example output, we have filtered the raw STARsolo counts matrix (to remove empty droplets) using a provided list of cell barcodes (see example_data/cell_meta_example.csv).
The raw counts matrix output by the pipeline for example Sample_1006_1007 can be found here:
../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/Sample_1006_1007_scHLApers_Solo.out/GeneFull_Ex50pAS/raw/UniqueAndMult-EM.mtx
A filtered version is located here (read into R using readRDS):
../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/exp_EM.rds
Note: The classical HLA genes are named IMGT_A, IMGT_C, IMGT_B, IMGT_DRB1, IMGT_DQA1, IMGT_DQB1, IMGT_DPA1, IMGT_DPB1.
Support
For questions and assistance not answered in tutorials, you can contact Joyce Kang (joyce_kang AT hms.harvard DOT edu).
Reproducing results from the manuscript
Code to reproduce the figures and analyses from Kang et al. will become available at https://github.com/immunogenomics/hla2023.