chromap
chromap copied to clipboard
Output peak by cell count matrix or mtx file
Hello,
The fragment file returned by chromap is very useful for scATAC-seq, but sometimes, we need the peak by cell count matrix (or .mtx) for further analysis. Is there a way of getting this matrix output? In the paper, you compare the peaks generated by chromap, bowtie2 and bwa, so I assume there is a way of getting the peak and peak by cell matrix.
Currently, I'm doing this in a awkward way by converting the fragment file back to 50-bp reads:
awk 'BEGIN{OFS="\t"}{print $1, $2, $2+50, $4, ".", "+" "\n" $1, $3-50, $3, $4, ".", "-"}' fragments.tsv | \
bedClip stdin genome.sizes stdout | \
sort -k1,1 -k2,2n | \
gzip > reads.bed.gz
Then, I feed the reads.bed.gz file to MACS2 for peak calling with --shift -100 --extsize 200, and use coverageBed to count reads over peaks for single cells.
What's your thoughts on this? Doing in this way is still WAYYYY faster than cellranger-atac, but it would be great that chromap can directly output a peak by cell matrix or the like. I'm sure many people are interested in this feature.
Thank you!!!!
Regards, Xi
In the paper, for scATAC-seq, we used MAESTRO to analyze the data. It will call peaks and generate peak by cell matrix and etc.
For your case, I would like to suggest two possible ways.
- Use
-f BEDPEwhen you call peaks with MACS. The fragment file generated by either Chromap or CellRanger should work with this option. But I bet MACS might then not take--shiftor--extsizein this mode and I forgot how MACS would model the cut sites in this case. You may consult MACS people on this problem. - You may use
--TagAlignwith Chromap. It will generate mappings in TagAlign, which seems to be the "BED" file you are trying to get with the command. Basically, each read in the pair will have a row in the output with + or - strand. So you can specify the modeling parameters for MACS. However, barcodes are not in the output for this format, which means it only works for bulk data. And for single-cell data, you have to again generate the fragment file at some time point, which means this way might not be better than generating the fragment file first and then converting it using the command you specified.
Though TagAlign is designed for bulk data (actually for ChIP-seq mapping long time ago in ENCODE project), we are thinking of having a single cell version of TagAlign, i.e., adding a column for barcodes when barcodes are available. Let us know if you think this would be helpful and we may add it in the future release.
Ah, I see. I will have a look at MAESTRO to see how it is done.
Yes, I personally think adding an extra column with cell barcodes to the TagAlign file would be great, which makes getting the peak by cell matrix easier and more flexible if users want to have a specific way of calling peaks and generating the matrix.
Thank you !!
Hi @haowenz - I'm looking at using chromap in a similar fashion to the above. I have my chromap alignments for 4 samples, and I called peaks on each chromap fragments file with
macs2 callpeak -f BEDPE -g hs --nomodel --shift 100 --extsize 200 -t sample1_fragments.bed ...
macs2 callpeak -f BEDPE -g hs --nomodel --shift 100 --extsize 200 -t sample2_fragments.bed ...
...
What I'd like to do is create a union set of the 4 sets of peaks (ie cat -> bedtools sort -> bedtools merge) and then bring the data into Seurat/Signac.
Is it inappropriate to use Signac's FeatureMatrix supplying chromap's fragments files and this new union peak set to get a cells x peaks matrix like the above user was looking for? Or is there something I'm missing here?
There are some alternative methods/papers like scATAK that utilize an additional step to re-quantify the fragments under the new peaks from the raw FASTQs, and I'm wondering what the benefits are (if any) of including that step.
Thanks!
Is it inappropriate to use Signac's FeatureMatrix supplying chromap's fragments files and this new union peak set to get a cells x peaks matrix like the above user was looking for? Or is there something I'm missing here?
It seems that you are using paired-end mode (-f BEDPE) in MACS. So it should work with Chromap default mapping output for scATAC-seq reads. Since you are using paired-end mode, the fragment length is known in the fragment file and used by MACS, and parameters like --shift 100 --extsize 200 might not be used when running MACS. I am not sure about this, you should consult MACS people on this.
There are some alternative methods/papers like scATAK that utilize an additional step to re-quantify the fragments under the new peaks from the raw FASTQs, and I'm wondering what the benefits are (if any) of including that step.
scATAK was just recently posted on BioRxiv. I never used it and I am not familiar with it. You may look at their paper or ask them to get an answer for this problem.
Thanks @haowenz. For others landing here, the MACS group seems to recommend the following for peak-calling from the chromap fragments file, since they also are the authors of MAESTRO:
macs2 callpeak -f BEDPE -g hs --nomodel --extsize=50 --keep-dup all ...
MAESTRO then computes a cells x peaks matrix starting from a simple bedtools intersect call, but it seems one could use Signac::FeatureMatrix or plyranges or other means of counting chromap fragments that overlap peaks
Hi, just happened to see this question. In my current workflow, I have totally switched to Chromap instead of cellranger :) For the peak calling issue, I found that in most cases, at least for ATAC-seq data, the default setting of macs2 does a pretty decent job (better than the output from cellranger), import the fragment file , call peaks with macs2 and then generate the feature matrix. Please see the signac tutorials for reference: call peaks and feature matrix Hope it helps!
Hi, following up #7 (I know it was opened more than 1y ago, I had no time to go on with chromap…), I've noticed that a lot has been changed and feature matrix is still hidden. Uncommenting like this
diff --git a/src/chromap_driver.cc b/src/chromap_driver.cc
index 31f9774..85b9cee 100644
--- a/src/chromap_driver.cc
+++ b/src/chromap_driver.cc
@@ -74,16 +74,16 @@ void ChromapDriver::ParseArgsAndRun(int argc, char *argv[]) {
cxxopts::value<double>(),
"FLT")("t,num-threads", "# threads for mapping [1]",
cxxopts::value<int>(), "INT");
- // options.add_options("Peak")
- // ("cell-by-bin", "Generate cell-by-bin matrix")
- // ("bin-size", "Bin size to generate cell-by-bin matrix [5000]",
- // cxxopts::value<int>(), "INT")
- // ("depth-cutoff", "Depth cutoff for peak calling [3]",
- // cxxopts::value<int>(), "INT")
- // ("peak-min-length", "Min length of peaks to report [30]",
- // cxxopts::value<int>(), "INT")
- // ("peak-merge-max-length", "Peaks within this length will be merged [30]",
- // cxxopts::value<int>(), "INT");
+ options.add_options("Peak")
+ ("cell-by-bin", "Generate cell-by-bin matrix")
+ ("bin-size", "Bin size to generate cell-by-bin matrix [5000]",
+ cxxopts::value<int>(), "INT")
+ ("depth-cutoff", "Depth cutoff for peak calling [3]",
+ cxxopts::value<int>(), "INT")
+ ("peak-min-length", "Min length of peaks to report [30]",
+ cxxopts::value<int>(), "INT")
+ ("peak-merge-max-length", "Peaks within this length will be merged [30]",
+ cxxopts::value<int>(), "INT");
options.add_options("Input")("r,ref", "Reference file",
cxxopts::value<std::string>(), "FILE")(
"x,index", "Index file", cxxopts::value<std::string>(), "FILE")(
@@ -101,8 +101,8 @@ void ChromapDriver::ParseArgsAndRun(int argc, char *argv[]) {
cxxopts::value<std::string>(), "STR");
options.add_options("Output")("o,output", "Output file",
cxxopts::value<std::string>(), "FILE")
- //("p,matrix-output-prefix", "Prefix of matrix output files",
- // cxxopts::value<std::string>(), "FILE")
+ ("p,matrix-output-prefix", "Prefix of matrix output files",
+ cxxopts::value<std::string>(), "FILE")
("output-mappings-not-in-whitelist",
"Output mappings with barcode not in the whitelist")(
"chr-order",
is not sufficient. I've uncommented also the specific part in chromap.h, trying to expose the --cell-by-bin functionality. Unfortunately it doesn't compile, as many variables are now undeclared. I've modified all those I believe are now in mapping_parameters_
diff --git a/src/chromap.h b/src/chromap.h
index b80d006..1af0a12 100644
--- a/src/chromap.h
+++ b/src/chromap.h
@@ -1074,23 +1074,23 @@ void Chromap::MapPairedEndReads() {
// Temporarily disable feature matrix output. Do not delete the following
// commented code.
- // if (!is_bulk_data_ && !matrix_output_prefix_.empty()) {
- // if constexpr (std::is_same<MappingRecord,
- // PairedEndMappingWithBarcode>::value) {
- // FeatureBarcodeMatrix feature_barcode_matrix(
- // cell_by_bin_, bin_size_, multi_mapping_allocation_distance_,
- // depth_cutoff_to_call_peak_);
- // std::vector<std::vector<PairedEndMappingWithBarcode>> &mappings =
- // allocate_multi_mappings_
- // ? allocated_mappings_on_diff_ref_seqs
- // : (remove_pcr_duplicates_ ? deduped_mappings_on_diff_ref_seqs
- // : mappings_on_diff_ref_seqs);
-
- // feature_barcode_matrix.OutputFeatureMatrix(num_reference_sequences,
- // reference, mappings,
- // matrix_output_prefix_);
- // }
- //}
+ if (!mapping_parameters_.is_bulk_data && !mapping_parameters_.matrix_output_prefix.emp
+ if constexpr (std::is_same<MappingRecord,
+ PairedEndMappingWithBarcode>::value) {
+ FeatureBarcodeMatrix feature_barcode_matrix(
+ mapping_parameters_.cell_by_bin, mapping_parameters_.bin_size, mapping_paramete
+ mapping_parameters_.depth_cutoff_to_call_peak);
+ std::vector<std::vector<PairedEndMappingWithBarcode>> &mappings =
+ mapping_parameters_.allocate_multi_mappings
+ ? allocated_mappings_on_diff_ref_seqs
+ : (mapping_parameters_.remove_pcr_duplicates ? deduped_mappings_on_diff_ref
+ : mappings_on_diff_ref_seqs);
+
+ feature_barcode_matrix.OutputFeatureMatrix(num_reference_sequences,
+ reference, mappings,
+ mapping_parameters_.matrix_output_prefix
+ }
+ }
}
reference.FinalizeLoading();
Again, it does not compile as
g++ -std=c++17 -Wall -O3 -fopenmp -msse4.1 -c src/chromap.cc -o objs/chromap.o
In file included from src/chromap.cc:1:
src/chromap.h: In member function 'void chromap::Chromap::MapPairedEndReads()':
src/chromap.h:1085:19: error: 'allocated_mappings_on_diff_ref_seqs' was not declared in this scope; did you mean 'mappings_on_diff_ref_seqs'?
1085 | ? allocated_mappings_on_diff_ref_seqs
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| mappings_on_diff_ref_seqs
src/chromap.h:1086:64: error: 'deduped_mappings_on_diff_ref_seqs' was not declared in this scope; did you mean 'mappings_on_diff_ref_seqs'?
1086 | : (mapping_parameters_.remove_pcr_duplicates ? deduped_mappings_on_diff_ref_seqs
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| mappings_on_diff_ref_seqs
make: *** [Makefile.my:27: objs/chromap.o] Error 1
allocated_mappings_on_diff_ref_seqs and deduped_mappings_on_diff_ref_seqs do not exist anymore and I can't easily find how they are called (and treated) in recent version. Any hint?
We currently do not have a concrete plan on this. Because the users would just use their own way to filter the fragments and compute some feature matrix in most cases. And this is usually an interactive process, the users would just try a couple of thresholds and plot the data to see if the filtering makes sense. That's why we commented out this code and do not plan to add them in the near future.
That's too bad. I guess we may try to work on a fork.
Assuming feature_barcode_matrix.OutputFeatureMatrix worked (at least at the very beginning, when it was exposed) and returned a condensed matrix over bins, assuming most of the variables can be (correctly) moved to mapping_parameters_, what would be the equivalent of this single instruction
? allocated_mappings_on_diff_ref_seqs
+ : (mapping_parameters_.remove_pcr_duplicates ? deduped_mappings_on_diff_ref
+ : mappings_on_diff_ref_seqs);
in current code?
I think you can just use mappings_on_diff_ref_seqs as input for feature_barcode_matrix.OutputFeatureMatrix. But this is not supported for low memory mode.
That's too bad. I guess we may try to work on a fork. Assuming
feature_barcode_matrix.OutputFeatureMatrixworked (at least at the very beginning, when it was exposed) and returned a condensed matrix over bins, assuming most of the variables can be (correctly) moved tomapping_parameters_, what would be the equivalent of this single instruction? allocated_mappings_on_diff_ref_seqs + : (mapping_parameters_.remove_pcr_duplicates ? deduped_mappings_on_diff_ref + : mappings_on_diff_ref_seqs);in current code?
Thanks. On a second thought I agree with you that having results in fragments is much more flexible and it’s better to have those in place of a feature matrix. Thanks