clustermq SSH transfer scales badly for large data

SSH transfer scales badly for large data

Open mschubert opened this issue 3 years ago • 7 comments

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)

Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found

Originally posted by @mattwarkentin in https://github.com/wlandau/targets/issues/237#issuecomment-736016884

Nov 30 '20 20:11 mschubert

@mattwarkentin That may actually be an issue with object.size (because it accesses an R internal), what about if you use sum?

Nov 30 '20 20:11 mschubert

clustermq::Q(function(x) sum(x), x = list(rnorm(1e8)), n_jobs = 1)

Connecting via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [23.4s 12.2% CPU]; Worker: [avg 14.0% CPU, max 201689286.0 Mb]
[[1]]
[1] 1940.743

Nov 30 '20 20:11 mattwarkentin

Ok. So the reported memory is wrong, but the return time looks good.

How about if you use the same size as your problematic file in memory? How about if you use the actual file contents you had?

(It's late here and I can't think anymore, I'll revisit this tomorrow)

Nov 30 '20 20:11 mschubert

No worries!

I will try out a handful of tests using toy and real data and post the results here.

Nov 30 '20 20:11 mattwarkentin

Sending toy data nearly the same size as actual data (~3.2Gb)

lobstr::obj_size(rnorm(4e8))

3,200,000,048 B

For comparison, the size of the data in my previous comment (rnorm(1e8)) was 800,000,048 B, or 800Mb.

clustermq::Q(function(x) sum(x), x = list(rnorm(4e8)), n_jobs = 1)

The above command timed out after 20 minutes (clustermq.worker.timeout = 1200). So for data that is 4x larger in-memory, it took at least 52x longer to transfer until it timed out. Seems like the transfer times scales nearly cubically with in-memory sizes (actually, I guess it could be anything cubic or larger, we don't really know).

Nov 30 '20 21:11 mattwarkentin

Seems like that explains https://github.com/wlandau/targets/issues/237 (please correct me if I am wrong).

Nov 30 '20 21:11 wlandau

Good timing. I just updated the targets issue to report these findings.

Nov 30 '20 21:11 mattwarkentin

Testing this on 0.9.2.9000 using SSH:

fx = function(NUM) clustermq::Q(function(x) sum(x), x=list(rnorm(NUM)), n_jobs = 1)
sapply(c(2.5e7, 5e7, 1e8, 2e8, 4e8), fx)

Number of `rnorm`	Data size	Time to complete
2.5e7	200 Mb	24 seconds
5e7	400 Mb	47 seconds
1e8	800 Mb	1.4 minutes
2e8	1.6 Gb	2.7 minutes
4e8	3.2 Gb	5.4 minutes

For comparison, transferring 1e8 random numbers in a 733 Mb rds file via scp took 1.2 minutes.

So overall, this does not seem to be a problem (anymore?)

Dec 16 '23 17:12 mschubert

clustermq clustermq copied to clipboard

SSH transfer scales badly for large data

Sending toy data nearly the same size as actual data (~3.2Gb)

clustermq
clustermq copied to clipboard