clustermq icon indicating copy to clipboard operation
clustermq copied to clipboard

SSH transfer scales badly for large data

Open mschubert opened this issue 3 years ago • 7 comments

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)
Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found

Originally posted by @mattwarkentin in https://github.com/wlandau/targets/issues/237#issuecomment-736016884

mschubert avatar Nov 30 '20 20:11 mschubert

@mattwarkentin That may actually be an issue with object.size (because it accesses an R internal), what about if you use sum?

mschubert avatar Nov 30 '20 20:11 mschubert

clustermq::Q(function(x) sum(x), x = list(rnorm(1e8)), n_jobs = 1)
Connecting via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [23.4s 12.2% CPU]; Worker: [avg 14.0% CPU, max 201689286.0 Mb]
[[1]]
[1] 1940.743

mattwarkentin avatar Nov 30 '20 20:11 mattwarkentin

Ok. So the reported memory is wrong, but the return time looks good.

How about if you use the same size as your problematic file in memory? How about if you use the actual file contents you had?

(It's late here and I can't think anymore, I'll revisit this tomorrow)

mschubert avatar Nov 30 '20 20:11 mschubert

No worries!

I will try out a handful of tests using toy and real data and post the results here.

mattwarkentin avatar Nov 30 '20 20:11 mattwarkentin

Sending toy data nearly the same size as actual data (~3.2Gb)

lobstr::obj_size(rnorm(4e8))
3,200,000,048 B

For comparison, the size of the data in my previous comment (rnorm(1e8)) was 800,000,048 B, or 800Mb.

clustermq::Q(function(x) sum(x), x = list(rnorm(4e8)), n_jobs = 1)

The above command timed out after 20 minutes (clustermq.worker.timeout = 1200). So for data that is 4x larger in-memory, it took at least 52x longer to transfer until it timed out. Seems like the transfer times scales nearly cubically with in-memory sizes (actually, I guess it could be anything cubic or larger, we don't really know).

mattwarkentin avatar Nov 30 '20 21:11 mattwarkentin

Seems like that explains https://github.com/wlandau/targets/issues/237 (please correct me if I am wrong).

wlandau avatar Nov 30 '20 21:11 wlandau

Good timing. I just updated the targets issue to report these findings.

mattwarkentin avatar Nov 30 '20 21:11 mattwarkentin

Testing this on 0.9.2.9000 using SSH:

fx = function(NUM) clustermq::Q(function(x) sum(x), x=list(rnorm(NUM)), n_jobs = 1)
sapply(c(2.5e7, 5e7, 1e8, 2e8, 4e8), fx)
Number of rnorm Data size Time to complete
2.5e7 200 Mb 24 seconds
5e7 400 Mb 47 seconds
1e8 800 Mb 1.4 minutes
2e8 1.6 Gb 2.7 minutes
4e8 3.2 Gb 5.4 minutes

For comparison, transferring 1e8 random numbers in a 733 Mb rds file via scp took 1.2 minutes.

So overall, this does not seem to be a problem (anymore?)

mschubert avatar Dec 16 '23 17:12 mschubert