chrombpnet
chrombpnet copied to clipboard
chromBPNet training questions
Hi,
I am integrating ChromBPNet analysis to refine snATAC-seq data (10x genomics) derived from skeletal muscle lysate. I am running into the following questions when training chromeBPNet:
- Is there a minimum number of cells or read depth necessary for training chromBPNet? Our total number of cells pre-QC filtering is 6305 and 4705 post-QC filtering.
- Would you recommend training chromeBPNet on only our cell population of interest, or all nuclei derived from our tissue lysate? My QC filtered dataset has 3580 cells of my population of interest. I’m inclined to train with a subset of my population of interest only, but am unsure whether my sample is sufficiently large for model training and subsequent analysis.
- How should I handle peak calling files? I have treated (n = 2 bio replicates) and control (n = 1 bio replicates) available. Each of these files has a peak.bed files associated with it. According to Anushri in Issue #117 on github, I should not be using the peak.bed files generated by 10x. Am I correct in understanding that the recommendation is to take the merged.bam file created in the previous step to peak call manually using MACS2? In doing this, I would have given the following command:
!macs2 callpeak -t data/downloads/merged.bam -f BAMPE -n "MACS2Peaks" -g "mm" -p 0.01 --shift -75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits --outdir data/downloads/MACS2PeakCallingPE
as recommended by the ENCODE pipeline with the exception of changing -f input to "BAMPE" instead of "BAM" as we are working with paired-end data. However, Anshul advised against this change in issue #176. Does this mean that I should keep the -f command as "BAM" even though I am working with paired-end data?
- Multiple folds: I would like to confirm my understanding of the usage of multiple folds. As I understand it I should be creating multiple folds (is there a recommended number?) in the splits folder each of which contains a different combination of training and validation chromosomes. I would then train a bias model and chrombpnet model for each fold separately. Later on, when using the tools, I would have to average out the bigwig or h5 files before inputting them into a given tool. Please let me know if this sounds right.
Thank you!