Integration Step Fails
Hi Wenbao,
I'm working with scATAC-pro to integrate 28 samples together using the following script (based on a script that I believe you authored):
#!/bin/bash
#SBATCH --job-name=merge_allPeaks_scATAC # Job name
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=6 # Number of CPU cores per task
#SBATCH --mem=800Gb # Job memory request
#SBATCH --time=12:00:00 # Time limit hrs:min:sec
#SBATCH --output=logs/eo_%x-%A-%a.log # Standard output and error log
# scATAC-pro 1.5.1 which uses R 4.2.2 -- Doesn't work as a batch script but does interactively... weird...
start_time=$SECONDS
echo "Date = $(date)"
echo "Time = $(date +%H:%M:%S)"
echo "Hostname = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo "SLURM_JOBID = ${SLURM_JOBID}"
echo "SLURM_JOB_NODELIST= ${SLURM_JOB_NODELIST}"
echo "SLURM_JOB_PARTITION= ${SLURM_JOB_PARTITION}"
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
echo "SLURM_ARRAY_TASK_ID = $SLURM_ARRAY_TASK_ID"
# Load required modules
# module load singularity
WD="/mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity"
cd $WD
# remember to change cell_caller cutoff in configure files
echo "Integrating All scATAC-seq Samples:"
# Array of subdirectory paths
SUBDIRS=(
"day_3_scATACpro/ALL4364_Combo_day_3"
"day_3_scATACpro/ALL4364_Control_day_3"
"day_3_scATACpro/ALL4364_Rux_day_3"
"day_3_scATACpro/ALL4364_Ven_day_3"
"day_3_scATACpro/JH331_Combo_day_3"
"day_3_scATACpro/JH331_Control_day_3"
"day_3_scATACpro/JH331_Rux_day_3"
"day_3_scATACpro/JH331_Ven_day_3"
"day_3_scATACpro/NH011_Combo_day_3"
"day_3_scATACpro/NH011_Control_day_3"
"day_3_scATACpro/NH011_Das_day_3"
"day_3_scATACpro/NH011_Ven_day_3"
"day_28_scATACpro/ALL2128_Combo_day_28"
"day_28_scATACpro/ALL2128_Control_day_28"
"day_28_scATACpro/ALL2128_Rux_day_28"
"day_28_scATACpro/ALL2128_Ven_day_28"
"day_28_scATACpro/ALL4364_Combo_day_28"
"day_28_scATACpro/ALL4364_Control_day_28"
"day_28_scATACpro/ALL4364_Rux_day_28"
"day_28_scATACpro/ALL4364_Ven_day_28"
"day_28_scATACpro/JH331_Combo_day_28"
"day_28_scATACpro/JH331_Control_day_28"
"day_28_scATACpro/JH331_Rux_day_28"
"day_28_scATACpro/JH331_Ven_day_28"
"day_28_scATACpro/NH011_Combo_day_28"
"day_28_scATACpro/NH011_Control_day_28"
"day_28_scATACpro/NH011_Rux_day_28"
"day_28_scATACpro/NH011_Ven_day_28"
)
# Print all elements of SUBDIRS, each on a new line
for subdir in "${SUBDIRS[@]}"; do
echo "$subdir"
done
# base_name=$(basename "${SUBDIRS[0]}")
# sample_name=$(echo "$base_name" | cut -d'_' -f1)
# time_point=$(echo "$base_name" | rev | cut -d'_' -f1 | rev)
config_file_name="configure/integrated/configure_allPeaks_integrated.txt"
echo "$config_file_name"
# Get a comma-separated list of peak files
peak_files=$(find "${SUBDIRS[@]}/peaks/MACS2/" -type f -name "*_features_BlacklistRemoved.bed" | tr '\n' ',' | sed 's/,$//')
# Use the peak_files variable in the singularity exec command
singularity exec --bind /mnt/isilon/ \
-H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
--cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
scATAC-pro -s mergePeaks -i \
"${peak_files},200,0.01" \
-c $config_file_name \
-o integrated/all_integrated/ \
--verbose
# re-construct matrix using combined peak set
for (( i=0; i<${#SUBDIRS[@]}; i++ ));
do
merged_peaks=$(find "integrated/all_integrated/peaks/" -type f -name "merged_peaks.bed")
fragments=$(find "${SUBDIRS[$i]}/summary/" -type f -name "*.fragments.tsv.gz")
barcodes=$(find "${SUBDIRS[$i]}/filtered_matrix/MACS2/FILTER/" -type f -name "barcodes_doubletsRemoved.txt")
singularity exec --bind /mnt/isilon \
-H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
--cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
scATAC-pro -s reConstMtx -i \
"${merged_peaks},${fragments},${barcodes}" \
-c $config_file_name \
--verbose \
-o ${SUBDIRS[$i]}
done
#integrate mtx
input_matrices=$(find "${SUBDIRS[@]}/filtered_matrix/MACS2/FILTER/reConstruct_matrix/" -type f -name "matrix.mtx" | tr '\n' ',' | sed 's/,$//')
singularity exec --bind /mnt \
-H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
--cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
scATAC-pro -s integrate_mtx -i \
"${input_matrices}" \
-c $config_file_name \
--verbose \
-o integrated/all_integrated/
end_time=$SECONDS
# elapsed time with second resolution
elapsed=$((end_time - start_time))
echo "Job Complete"
echo "Max Memory Used = $(sstat -j ${SLURM_JOBID} --format=MaxVMSize --noheader --parsable2)"
echo "Date = $(date)"
echo "Time = $(date +%H:%M:%S)"
eval "echo Elapsed time: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"
The first two steps (merge peaks and reconstruct matrix) work, but then when the integration step happens, I get the following error(s):
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file '/mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/seurat_obj_VFACS.rds', probable reason 'No such file or directory'
Execution halted
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 34: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 35: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 36: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 37: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 38: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 39: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
make: *** [/usr/local/bin/scATAC-pro_1.5.1/scripts/Makefile:303: integrate_mtx] Error 1
Job Complete
Max Memory Used =
Date = Tue May 30 09:50:20 PM EDT 2023
Time = 21:50:20
Elapsed time: 0 days 05 hr 17 min 59 sec
No clue why this is happening. I am right now reading in the reconstructed matrix rds files and using those to integrate the samples in R using the Signac workflow, but the resulting UMAP plot appears over-integrated and has a lot of noise. I would really like to get the scATAC-pro-generated integrated Seurat object of the 28 samples to compare and see if VFACs offers better integration performance. Please let me know if you have an idea as to what this problem stems from and how I could fix it. Also please let me know if you have any extra questions.
Thank you, Arta
I'm not sure if no further information. Have you tried run integrate module instead of breaking it into 3 steps? or may need to increase the gap from 200bp to 500 or 1000 in case of too many peaks were used.
Have you tried run integrate module instead of breaking it into 3 steps?
No, I haven't. I'll give it a shot, might help.
or may need to increase the gap from 200bp to 500 or 1000 in case of too many peaks were used.
I think it's about 220k merged peaks or so, not sure if that's a lot or a little. But I had no problem integrating in R with rLSI and only 400Gb of memory, probably less. It's just that the integration UMAP looked bad.
220k seems good to me! Thanks.
I tried to use the integrate module instead of breaking it up by the submodules like before:
# Array of subdirectory paths
SUBDIRS=(
"day_3_scATACpro/ALL4364_Combo_day_3"
"day_3_scATACpro/ALL4364_Control_day_3"
"day_3_scATACpro/ALL4364_Rux_day_3"
"day_3_scATACpro/ALL4364_Ven_day_3"
"day_3_scATACpro/JH331_Combo_day_3"
"day_3_scATACpro/JH331_Control_day_3"
"day_3_scATACpro/JH331_Rux_day_3"
"day_3_scATACpro/JH331_Ven_day_3"
"day_3_scATACpro/NH011_Combo_day_3"
"day_3_scATACpro/NH011_Control_day_3"
"day_3_scATACpro/NH011_Das_day_3"
"day_3_scATACpro/NH011_Ven_day_3"
"day_28_scATACpro/ALL2128_Combo_day_28"
"day_28_scATACpro/ALL2128_Control_day_28"
"day_28_scATACpro/ALL2128_Rux_day_28"
"day_28_scATACpro/ALL2128_Ven_day_28"
"day_28_scATACpro/ALL4364_Combo_day_28"
"day_28_scATACpro/ALL4364_Control_day_28"
"day_28_scATACpro/ALL4364_Rux_day_28"
"day_28_scATACpro/ALL4364_Ven_day_28"
"day_28_scATACpro/JH331_Combo_day_28"
"day_28_scATACpro/JH331_Control_day_28"
"day_28_scATACpro/JH331_Rux_day_28"
"day_28_scATACpro/JH331_Ven_day_28"
"day_28_scATACpro/NH011_Combo_day_28"
"day_28_scATACpro/NH011_Control_day_28"
"day_28_scATACpro/NH011_Rux_day_28"
"day_28_scATACpro/NH011_Ven_day_28"
)
# Print all elements of SUBDIRS, each on a new line
for subdir in "${SUBDIRS[@]}"; do
echo "$subdir"
done
config_file_name="configure/integrated/configure_allPeaks_integrated.txt"
echo "$config_file_name"
# Get a comma-separated list of peak files
peak_files=$(find "${SUBDIRS[@]}/peaks/MACS2/" -type f -name "*_features_BlacklistRemoved.bed" | tr '\n' ',' | sed 's/,$//')
# the output dir "all_integrated2" is where I'm storing the results of the 'integrate' module instead of calling the submodules
# independently, in case that makes a difference.
singularity exec --bind /mnt/isilon/ \
-H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
--cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
scATAC-pro -s integrate -i \
"${peak_files},200,0.01" \
-c $config_file_name \
-o integrated/all_integrated2/ \
--verbose
and it seemed to get farther this time but unfortunately I am running into another issue:
Integrate by VFACS
Performing TF_IDF normalization
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Performing TF_IDF normalization
Centering and scaling data matrix
|======================================================================| 100%
Performing TF_IDF normalization
Warning: The following arguments are not used: resl
Warning: The following arguments are not used: resl
Performing TF_IDF normalization
Error in PrepDR(object = object, features = features, verbose = verbose) :
Variable features haven't been set. Run FindVariableFeatures() or provide a vector of feature names.
Calls: run_integration ... RunPCA -> RunPCA.Seurat -> RunPCA -> RunPCA.Assay -> PrepDR
Execution halted
make: *** [/usr/local/bin/scATAC-pro_1.5.1/scripts/Makefile:283: integrate] Error 1
Job Complete
Max Memory Used =
Date = Thu Jun 1 02:42:54 PM EDT 2023
Time = 14:42:54
Elapsed time: 0 days 03 hr 37 min 11 sec
I guess probably you're using an old version of configure_user.txt file. Do you define Top_Variable_Features in you configure file? If not, can you add a line like Top_Variable_Features = 5000 into it?
I have this line in the configure_user.txt: Top_Variable_Features = .1 ## number/fraction of variable features used for seurat
Can you increase it to a bigger value like 0.2? Because I had internally filtered some rare peaks (like open in less than 1% of cells) from the variable features, I speculate that all the variable features may be filtered for this particular sample. This should be rare but it could happen. (No sure if it's the case, but I will fix this bug anyware soon!)
I'll give it a shot and let you know.
@aseyedia is that the issue? Thanks.
I actually haven't had a chance to try it out yet, I'll do it today, thanks for the reminder!
I submitted the batch job just now. One thing I was wondering is if the integrate module calls merge_peaks and reConstMtx as well? I need those submodules for my samples.
yes, integrate is just a wrapper.
Here's the configure file:
(base) [seyediana@reslnvhpc071 scATAC-pro_singularity]$ grep "Top" configure/integrated/configure_allPeaks_integrated.txt
## Top_Variable_Features = 0.1 ## number/fraction of variable features used for seurat
Top_Variable_Features = .2 ## number/fraction of variable features used for seurat
and here is the log file:
Integrate by VFACS
Performing TF_IDF normalization
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Performing TF_IDF normalization
Centering and scaling data matrix
|======================================================================| 100%
Performing TF_IDF normalization
Warning: The following arguments are not used: resl
Warning: The following arguments are not used: resl
Performing TF_IDF normalization
Error in PrepDR(object = object, features = features, verbose = verbose) :
Variable features haven't been set. Run FindVariableFeatures() or provide a vector of feature names.
Calls: run_integration ... RunPCA -> RunPCA.Seurat -> RunPCA -> RunPCA.Assay -> PrepDR
Execution halted
What's strange is that I didn't get this error when I extracted run_integrateSeuObj and ran it interactively. I can do some additional debugging on my end when I have some time.