scATAC-pro icon indicating copy to clipboard operation
scATAC-pro copied to clipboard

Integration Step Fails

Open aseyedia opened this issue 2 years ago • 13 comments

Hi Wenbao,

I'm working with scATAC-pro to integrate 28 samples together using the following script (based on a script that I believe you authored):

#!/bin/bash
#SBATCH --job-name=merge_allPeaks_scATAC        # Job name
#SBATCH --nodes=1                    # Run all processes on a single node
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --cpus-per-task=6            # Number of CPU cores per task
#SBATCH --mem=800Gb                    # Job memory request
#SBATCH --time=12:00:00              # Time limit hrs:min:sec
#SBATCH --output=logs/eo_%x-%A-%a.log    # Standard output and error log

# scATAC-pro 1.5.1 which uses R 4.2.2 -- Doesn't work as a batch script but does interactively... weird...

start_time=$SECONDS

echo "Date              = $(date)"
echo "Time              = $(date +%H:%M:%S)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo "SLURM_JOBID       = ${SLURM_JOBID}"
echo "SLURM_JOB_NODELIST= ${SLURM_JOB_NODELIST}"
echo "SLURM_JOB_PARTITION= ${SLURM_JOB_PARTITION}"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
echo "SLURM_ARRAY_TASK_ID =  $SLURM_ARRAY_TASK_ID"

# Load required modules
# module load singularity

WD="/mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity"

cd $WD

# remember to change cell_caller cutoff in configure files
echo "Integrating All scATAC-seq Samples:"

# Array of subdirectory paths
SUBDIRS=(
    "day_3_scATACpro/ALL4364_Combo_day_3"
    "day_3_scATACpro/ALL4364_Control_day_3"
    "day_3_scATACpro/ALL4364_Rux_day_3"
    "day_3_scATACpro/ALL4364_Ven_day_3"
    "day_3_scATACpro/JH331_Combo_day_3"
    "day_3_scATACpro/JH331_Control_day_3"
    "day_3_scATACpro/JH331_Rux_day_3"
    "day_3_scATACpro/JH331_Ven_day_3"
    "day_3_scATACpro/NH011_Combo_day_3"
    "day_3_scATACpro/NH011_Control_day_3"
    "day_3_scATACpro/NH011_Das_day_3"
    "day_3_scATACpro/NH011_Ven_day_3"
    "day_28_scATACpro/ALL2128_Combo_day_28"
    "day_28_scATACpro/ALL2128_Control_day_28"
    "day_28_scATACpro/ALL2128_Rux_day_28"
    "day_28_scATACpro/ALL2128_Ven_day_28"
    "day_28_scATACpro/ALL4364_Combo_day_28"
    "day_28_scATACpro/ALL4364_Control_day_28"
    "day_28_scATACpro/ALL4364_Rux_day_28"
    "day_28_scATACpro/ALL4364_Ven_day_28"
    "day_28_scATACpro/JH331_Combo_day_28"
    "day_28_scATACpro/JH331_Control_day_28"
    "day_28_scATACpro/JH331_Rux_day_28"
    "day_28_scATACpro/JH331_Ven_day_28"
    "day_28_scATACpro/NH011_Combo_day_28"
    "day_28_scATACpro/NH011_Control_day_28"
    "day_28_scATACpro/NH011_Rux_day_28"
    "day_28_scATACpro/NH011_Ven_day_28"
)

# Print all elements of SUBDIRS, each on a new line
for subdir in "${SUBDIRS[@]}"; do
    echo "$subdir"
done

# base_name=$(basename "${SUBDIRS[0]}")
# sample_name=$(echo "$base_name" | cut -d'_' -f1)
# time_point=$(echo "$base_name" | rev | cut -d'_' -f1 | rev)
config_file_name="configure/integrated/configure_allPeaks_integrated.txt"
echo "$config_file_name"

# Get a comma-separated list of peak files
peak_files=$(find "${SUBDIRS[@]}/peaks/MACS2/" -type f -name "*_features_BlacklistRemoved.bed" | tr '\n' ',' | sed 's/,$//')

# Use the peak_files variable in the singularity exec command
singularity exec --bind /mnt/isilon/ \
    -H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
    --cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
    scATAC-pro -s mergePeaks -i \
        "${peak_files},200,0.01" \
        -c $config_file_name \
        -o integrated/all_integrated/ \
        --verbose

# re-construct matrix using combined peak set
for (( i=0; i<${#SUBDIRS[@]}; i++ ));
do
    merged_peaks=$(find "integrated/all_integrated/peaks/" -type f -name "merged_peaks.bed")
    fragments=$(find "${SUBDIRS[$i]}/summary/" -type f -name "*.fragments.tsv.gz")
    barcodes=$(find "${SUBDIRS[$i]}/filtered_matrix/MACS2/FILTER/" -type f -name "barcodes_doubletsRemoved.txt")
    
    singularity exec --bind /mnt/isilon \
    -H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
    --cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
    scATAC-pro -s reConstMtx -i \
        "${merged_peaks},${fragments},${barcodes}" \
        -c $config_file_name \
        --verbose \
        -o ${SUBDIRS[$i]}
done

#integrate mtx
input_matrices=$(find "${SUBDIRS[@]}/filtered_matrix/MACS2/FILTER/reConstruct_matrix/" -type f -name "matrix.mtx" | tr '\n' ',' | sed 's/,$//')

singularity exec --bind /mnt \
    -H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
    --cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
    scATAC-pro -s integrate_mtx -i \
    "${input_matrices}" \
    -c $config_file_name \
    --verbose \
    -o integrated/all_integrated/

end_time=$SECONDS

# elapsed time with second resolution
elapsed=$((end_time - start_time))

echo "Job Complete"
echo "Max Memory Used  = $(sstat -j ${SLURM_JOBID} --format=MaxVMSize --noheader --parsable2)"
echo "Date              = $(date)"
echo "Time              = $(date +%H:%M:%S)"
eval "echo Elapsed time: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

The first two steps (merge peaks and reconstruct matrix) work, but then when the integration step happens, I get the following error(s):

Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/seurat_obj_VFACS.rds', probable reason 'No such file or directory'
Execution halted
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 34: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 35: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 36: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 37: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 38: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
/usr/local/bin/scATAC-pro_1.5.1/scripts/integrate_mtx.sh: line 39: /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/integrated/all_integrated/integrated/VisCello_obj/config.yml: No such file or directory
make: *** [/usr/local/bin/scATAC-pro_1.5.1/scripts/Makefile:303: integrate_mtx] Error 1
Job Complete
Max Memory Used  = 
Date              = Tue May 30 09:50:20 PM EDT 2023
Time              = 21:50:20
Elapsed time: 0 days 05 hr 17 min 59 sec

No clue why this is happening. I am right now reading in the reconstructed matrix rds files and using those to integrate the samples in R using the Signac workflow, but the resulting UMAP plot appears over-integrated and has a lot of noise. I would really like to get the scATAC-pro-generated integrated Seurat object of the 28 samples to compare and see if VFACs offers better integration performance. Please let me know if you have an idea as to what this problem stems from and how I could fix it. Also please let me know if you have any extra questions.

Thank you, Arta

aseyedia avatar May 31 '23 15:05 aseyedia

I'm not sure if no further information. Have you tried run integrate module instead of breaking it into 3 steps? or may need to increase the gap from 200bp to 500 or 1000 in case of too many peaks were used.

wbaopaul avatar May 31 '23 17:05 wbaopaul

Have you tried run integrate module instead of breaking it into 3 steps?

No, I haven't. I'll give it a shot, might help.

or may need to increase the gap from 200bp to 500 or 1000 in case of too many peaks were used.

I think it's about 220k merged peaks or so, not sure if that's a lot or a little. But I had no problem integrating in R with rLSI and only 400Gb of memory, probably less. It's just that the integration UMAP looked bad.

aseyedia avatar May 31 '23 17:05 aseyedia

220k seems good to me! Thanks.

wbaopaul avatar May 31 '23 17:05 wbaopaul

I tried to use the integrate module instead of breaking it up by the submodules like before:

# Array of subdirectory paths
SUBDIRS=(
    "day_3_scATACpro/ALL4364_Combo_day_3"
    "day_3_scATACpro/ALL4364_Control_day_3"
    "day_3_scATACpro/ALL4364_Rux_day_3"
    "day_3_scATACpro/ALL4364_Ven_day_3"
    "day_3_scATACpro/JH331_Combo_day_3"
    "day_3_scATACpro/JH331_Control_day_3"
    "day_3_scATACpro/JH331_Rux_day_3"
    "day_3_scATACpro/JH331_Ven_day_3"
    "day_3_scATACpro/NH011_Combo_day_3"
    "day_3_scATACpro/NH011_Control_day_3"
    "day_3_scATACpro/NH011_Das_day_3"
    "day_3_scATACpro/NH011_Ven_day_3"
    "day_28_scATACpro/ALL2128_Combo_day_28"
    "day_28_scATACpro/ALL2128_Control_day_28"
    "day_28_scATACpro/ALL2128_Rux_day_28"
    "day_28_scATACpro/ALL2128_Ven_day_28"
    "day_28_scATACpro/ALL4364_Combo_day_28"
    "day_28_scATACpro/ALL4364_Control_day_28"
    "day_28_scATACpro/ALL4364_Rux_day_28"
    "day_28_scATACpro/ALL4364_Ven_day_28"
    "day_28_scATACpro/JH331_Combo_day_28"
    "day_28_scATACpro/JH331_Control_day_28"
    "day_28_scATACpro/JH331_Rux_day_28"
    "day_28_scATACpro/JH331_Ven_day_28"
    "day_28_scATACpro/NH011_Combo_day_28"
    "day_28_scATACpro/NH011_Control_day_28"
    "day_28_scATACpro/NH011_Rux_day_28"
    "day_28_scATACpro/NH011_Ven_day_28"
)

# Print all elements of SUBDIRS, each on a new line
for subdir in "${SUBDIRS[@]}"; do
    echo "$subdir"
done

config_file_name="configure/integrated/configure_allPeaks_integrated.txt"
echo "$config_file_name"

# Get a comma-separated list of peak files
peak_files=$(find "${SUBDIRS[@]}/peaks/MACS2/" -type f -name "*_features_BlacklistRemoved.bed" | tr '\n' ',' | sed 's/,$//')

# the output dir "all_integrated2" is where I'm storing the results of the 'integrate' module instead of calling the submodules
# independently, in case that makes a difference.
singularity exec --bind /mnt/isilon/ \
    -H /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/ \
    --cleanenv /mnt/isilon/cscb/Projects/10x/dingy/ding_scRNA_ATACseq_PhALL/scATAC-pro_singularity/scatac-pro_latest.sif \
    scATAC-pro -s integrate -i \
        "${peak_files},200,0.01" \
        -c $config_file_name \
        -o integrated/all_integrated2/ \
        --verbose

and it seemed to get farther this time but unfortunately I am running into another issue:

Integrate by VFACS
Performing TF_IDF normalization
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Performing TF_IDF normalization
Centering and scaling data matrix
  |======================================================================| 100%
Performing TF_IDF normalization
Warning: The following arguments are not used: resl
Warning: The following arguments are not used: resl
Performing TF_IDF normalization
Error in PrepDR(object = object, features = features, verbose = verbose) :
  Variable features haven't been set. Run FindVariableFeatures() or provide a vector of feature names.
Calls: run_integration ... RunPCA -> RunPCA.Seurat -> RunPCA -> RunPCA.Assay -> PrepDR
Execution halted
make: *** [/usr/local/bin/scATAC-pro_1.5.1/scripts/Makefile:283: integrate] Error 1
Job Complete
Max Memory Used  =
Date              = Thu Jun  1 02:42:54 PM EDT 2023
Time              = 14:42:54
Elapsed time: 0 days 03 hr 37 min 11 sec

aseyedia avatar Jun 01 '23 19:06 aseyedia

I guess probably you're using an old version of configure_user.txt file. Do you define Top_Variable_Features in you configure file? If not, can you add a line like Top_Variable_Features = 5000 into it?

wbaopaul avatar Jun 02 '23 18:06 wbaopaul

I have this line in the configure_user.txt: Top_Variable_Features = .1 ## number/fraction of variable features used for seurat

aseyedia avatar Jun 02 '23 18:06 aseyedia

Can you increase it to a bigger value like 0.2? Because I had internally filtered some rare peaks (like open in less than 1% of cells) from the variable features, I speculate that all the variable features may be filtered for this particular sample. This should be rare but it could happen. (No sure if it's the case, but I will fix this bug anyware soon!)

wbaopaul avatar Jun 02 '23 18:06 wbaopaul

I'll give it a shot and let you know.

aseyedia avatar Jun 02 '23 20:06 aseyedia

@aseyedia is that the issue? Thanks.

wbaopaul avatar Jun 07 '23 12:06 wbaopaul

I actually haven't had a chance to try it out yet, I'll do it today, thanks for the reminder!

aseyedia avatar Jun 07 '23 13:06 aseyedia

I submitted the batch job just now. One thing I was wondering is if the integrate module calls merge_peaks and reConstMtx as well? I need those submodules for my samples.

aseyedia avatar Jun 07 '23 14:06 aseyedia

yes, integrate is just a wrapper.

wbaopaul avatar Jun 07 '23 14:06 wbaopaul

Here's the configure file:

(base) [seyediana@reslnvhpc071 scATAC-pro_singularity]$ grep "Top" configure/integrated/configure_allPeaks_integrated.txt 
## Top_Variable_Features = 0.1 ## number/fraction of variable features used for seurat
Top_Variable_Features = .2 ## number/fraction of variable features used for seurat

and here is the log file:

Integrate by VFACS
Performing TF_IDF normalization
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Performing TF_IDF normalization
Centering and scaling data matrix
  |======================================================================| 100%
Performing TF_IDF normalization
Warning: The following arguments are not used: resl
Warning: The following arguments are not used: resl
Performing TF_IDF normalization
Error in PrepDR(object = object, features = features, verbose = verbose) : 
  Variable features haven't been set. Run FindVariableFeatures() or provide a vector of feature names.
Calls: run_integration ... RunPCA -> RunPCA.Seurat -> RunPCA -> RunPCA.Assay -> PrepDR
Execution halted

What's strange is that I didn't get this error when I extracted run_integrateSeuObj and ran it interactively. I can do some additional debugging on my end when I have some time.

aseyedia avatar Jun 07 '23 20:06 aseyedia