canu icon indicating copy to clipboard operation
canu copied to clipboard

Using $TMPDIR for fast local sorting using ovStoreSorter

Open mmokrejs opened this issue 3 years ago • 5 comments

Hi, would it be possible to implement what you proposed in https://github.com/marbl/canu/issues/705#issuecomment-344597980 in point nr. 2?

My problem is is that on a 2TB RAM machine canu-2.1.1/build/bin/ovStoreSorter was run per each CPU core but just competing for network bandwidth and CPUs were just waiting for data (as reported by status D by htop utility). I lost about 6 days redoing the same task while it never really finished.

Could canu respect say $TMPDIR or $SCRATCH or $SCRATCHDIR and copy the input files(s) to the local storage, do the in-memory sorting, write out result, move result files back to my NFSv4 drives? Looking through the many opened issues at the moment this step seems like a common source of problems.

I thought I could just make ovStoreSorter and actually also ovStoreBucketizer to use all local RAM and decrease the number of concurrently runing jobs on the same host but editing my_genome/correction/my_genome.ovlStore.BUILDING/scripts/2-sort.sh did not help. One can alter the memory but not number of concurrently running jobs. Seems I would need to edit the my_genome.ovlStore.config binary file? I tried on commandline ovsMemory=1800 but it seems it does not kick into action if the config is already existing. Maybe it would not work anyway as the numbers of buckets would change and I would have to redo the ovlStore.BUILDING anyway? I use useGrid=false, btw. I think for sorting it should always use maximum amount of local memory available, even at the cost of using less CPUs.

Thank you,

mmokrejs avatar Mar 03 '21 20:03 mmokrejs

There is infrastructure in canu to copy bits to scratch if specified and save results back but it's not used for this step. I'm not sure copying to a scratch space would help. It is a linear scan of a bunch of files which are fully loaded into memory and then sorted/written in one go. I'd expect if the NFS can't handle a linear read it'd also have trouble with the copy contention.

You're right that changing memory would change partitions and thus can't be done in the middle of a run, you'd have to go back to the start of the ovlStore build. You can limit the concurrency with ovsConcurrency without changing the partition sizes. It will just run fewer concurrent jobs. You could also just copy the entire ovlStore.BUILDING folder to your tmp space and symlink the files back and run Canu as normal. It'll only output the results to the NFS space then. If it does make a significant difference in time/concurrency (including the copy time) then we can enable the scratch staging for this step in the future too.

skoren avatar Mar 03 '21 20:03 skoren

Hi Sergey, so I tried to copy initially the seqStore to the $TMPDIR and symlink to it but 20TB of data does not already fit onto the local SSD drives (15TB total). Not talking of the files to be created there under ovlStore.BUILDING which will be even a bit larger than seqStore.

I am trying now a machine in a different supercomputer center, geaographically closer to the NFSv4 storage. If canu could copy just its input file to $TMPDIR, do its works, and move it back to the source place, it would be tremendously helpful.

Also, overriding the ovsConcurrency could be done by canu when it retries the ovlStore building:

-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'ovS' concurrent execution on Mon Mar  1 13:48:19 2021 with 213854.332 GB free disk space (446 processes; 62 concurrently)

Now I understand the reason why it never finished. Because the network cannot provide simultaneously huge data for the 64 CPUs locally available. I would say, when a user uses useGrid=false and the job runs on a single node, the step should be used in serial mode and with the largest bucketsize possible (to ease the sorting).

-- Creating overlap store correction/my_genome.ovlStore using:
--    147 buckets
--    616 slices
--        using at most 29 GB memory each

I think instead of using only 26GB memory it could have taken 2TB on that machine in my case, now on the newer machine even 6TB would be available for a single thread.

As per htop the ovStoreBucketizer and ovStoreSorter are just waiting for the input data, computing happens mostly zero to 3% per CPU core, under some setups with less concurrent threads I managed to use even 50% of the 18 CPU cores grabbed through PBSpro. As I see it now I am thrashing three other 2TB RAM achines in the same way, at about same canu step.

mmokrejs avatar Mar 04 '21 08:03 mmokrejs

There's really no good solution here.

Overlap store building was designed to run on multiple nodes with a high bandwidth storage system. Running on just one node limits bandwidth to whatever that single node has. Reading 20 TB of data on a gigabit net should take 20 TB * 1024 GB/TB * 1024 MB*GB / 100 MB/sec / 3600 sec/hr = 58 hours, times two for a read pass and a write pass, and times two again for the bucketize stage and the sorting stage.

There's no gain to be had by making the jobs larger. It doesn't matter if the 20 TB of data is sorted in chunks of 20GB or 1TB - the sort process needs to read in 20 TB and write back 20 TB in either case.

When someone had a similar problem a few years ago - building a correction overlap store using a 'personal' NAS - it ended up taking 8 to 10 days if I remember correctly.

brianwalenz avatar Mar 04 '21 11:03 brianwalenz

Hmm, I thought that if the data is sorted in smaller groups then also the groups have to be re-sorted subsequently. So, that is does matter if one proceed ahead in larger chunks. But I take it as it is.

For me, it would be helpfull if canu exited with a clear message stating I shall run this on a single CPU in a 2 weeks long queue with say 30GB of RAM. That I can do easily. Then, it could tell me I should run some followup steps again on multiple CPUs. How do I determine the memory needed for this step?

mmokrejs avatar Mar 04 '21 13:03 mmokrejs

Interesting, I cannot submit a single-cpu job:

-- Detected gnuplot version '5.2 patchlevel 6   ' (from 'gnuplot') and image format 'png'.
-- Detected 32 CPUs and 472 gigabytes of memory.
-- Limited to 100 gigabytes from maxMemory option.
-- Limited to 1 CPUs from maxThreads option.
-- Detected PBSPro '19.0.0' with 'pbsnodes' binary in /usr/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
-- ERROR
-- ERROR  Limited to at most 100 GB memory via maxMemory option
-- ERROR  Limited to at most 1 threads via maxThreads option
-- ERROR
-- ERROR  Found 1 machine configuration:
-- ERROR    class0 - 1 machines with 1 cores with 100 GB memory each.
-- ERROR
-- ERROR  Task hap can't run on any available machines.
-- ERROR  It is requesting:
-- ERROR    hapMemory=8-16 memory (gigabytes)
-- ERROR    hapThreads=16-64 threads
-- ERROR
-- ERROR  No available machine configuration can run this task.
-- ERROR
-- ERROR  Possible solutions:
-- ERROR    Increase maxMemory
-- ERROR    Change hapMemory and/or hapThreads
-- ERROR

I don't have data for haplotype calling so this is probably a false sanity check.

mmokrejs avatar Mar 04 '21 16:03 mmokrejs