CHEUI icon indicating copy to clipboard operation
CHEUI copied to clipboard

C++ preprocess (m6A) creates a large number of temporary files

Open olliecheng opened this issue 5 months ago • 5 comments

Hi all,

Not sure if this is a duplicate of #38.

I've been trying to get CHEUI up and running but have been running into an issue with the preprocessing. Using the C++ preprocessing script (compiled on GCC v11.3, RHEL 9.4, commit 7b422f7808a3c2ffff56a9ead33a199824753b4e), I have been running it on a 858GB nanopolish eventalign file, generated from a ~2GB .fastq of reads.

In this instance, I noticed that the preprocessing script was producing an absurd amount of temporary files - over 7 million, before the HPC file quota ran out and the script was killed. (This was much to the chagrin of my university's HPC admin, and I promptly received a very strongly worded email advising me not to generate so many temporary files! 😅)

I've attached a small selection of ~1000 of these temporary files for debugging purposes, if it interests you. Each file seems to be very small - a few lines max, based off of my n = 10 sample size.

The aligned events file was called using:

nanopolish eventalign -t {threads} \
    --reads {input.reads} \
    --bam {input.bam} \
    --genome {REF} \
    --scale-events --signal-index --samples --print-read-names > {output}

and I was calling preprocess using:

# must first be in this path, or else the program crashes
cd $PATH_TO_CHEUI_PREPROCESS_DIR

./CHEUI -i $INPUT -m ../../kmer_models/model_kmer.csv -n {threads} --m6A -o $OUTPUT

See the attached sample of temporary files below: out_A_signals+IDs.zip

Let me know if there's anything else that you need. Ollie

olliecheng avatar Sep 29 '24 03:09 olliecheng