clustermq
clustermq copied to clipboard
SSH transfer scales badly for large data
clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)
Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs, :
1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found
Originally posted by @mattwarkentin in https://github.com/wlandau/targets/issues/237#issuecomment-736016884
@mattwarkentin That may actually be an issue with object.size
(because it accesses an R internal), what about if you use sum
?
clustermq::Q(function(x) sum(x), x = list(rnorm(1e8)), n_jobs = 1)
Connecting via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [23.4s 12.2% CPU]; Worker: [avg 14.0% CPU, max 201689286.0 Mb]
[[1]]
[1] 1940.743
Ok. So the reported memory is wrong, but the return time looks good.
How about if you use the same size as your problematic file in memory? How about if you use the actual file contents you had?
(It's late here and I can't think anymore, I'll revisit this tomorrow)
No worries!
I will try out a handful of tests using toy and real data and post the results here.
Sending toy data nearly the same size as actual data (~3.2Gb)
lobstr::obj_size(rnorm(4e8))
3,200,000,048 B
For comparison, the size of the data in my previous comment (rnorm(1e8)
) was 800,000,048 B
, or 800Mb.
clustermq::Q(function(x) sum(x), x = list(rnorm(4e8)), n_jobs = 1)
The above command timed out after 20 minutes (clustermq.worker.timeout = 1200
). So for data that is 4x larger in-memory, it took at least 52x longer to transfer until it timed out. Seems like the transfer times scales nearly cubically with in-memory sizes (actually, I guess it could be anything cubic or larger, we don't really know).
Seems like that explains https://github.com/wlandau/targets/issues/237 (please correct me if I am wrong).
Good timing. I just updated the targets
issue to report these findings.
Testing this on 0.9.2.9000
using SSH:
fx = function(NUM) clustermq::Q(function(x) sum(x), x=list(rnorm(NUM)), n_jobs = 1)
sapply(c(2.5e7, 5e7, 1e8, 2e8, 4e8), fx)
Number of rnorm |
Data size | Time to complete |
---|---|---|
2.5e7 | 200 Mb | 24 seconds |
5e7 | 400 Mb | 47 seconds |
1e8 | 800 Mb | 1.4 minutes |
2e8 | 1.6 Gb | 2.7 minutes |
4e8 | 3.2 Gb | 5.4 minutes |
For comparison, transferring 1e8
random numbers in a 733 Mb rds
file via scp
took 1.2 minutes.
So overall, this does not seem to be a problem (anymore?)