Parallel.js is slow due to unnecessary per-job overhead

Open benbucksch opened this issue 4 years ago • 2 comments

There seems to be a significant overhead per job, not just per worker.

My task is to run the levenshtein algo, comparing a given string with about 200000 target strings and finding the best match. This is ideal for parallelism, because it runs the exact same algo thousands of times, the algo is small, needs little data and no context. Without parallelism, it takes a significant time that's noticeable for the end user.

So, I've used paralleljs to run it on multiple CPU cores at the same time. However, I find that using paralleljs actually makes it slower than before with the synchronous version.

Environment:

node 13.2
Ubuntu Linux 20.04
CPU Ryzen 7 1700X, 8 cores with SMT, i.e. 16 virtual cores

Terminology:

worker: Maps to a OS level thread or process and run on a separate CPU core than other workers, in parallel to the other workers.
job: Running the function once for one of the data values.

I understand the purpose of Parallel.js to create the workers and to distribute the jobs to the workers in an efficient manner and with a comfortable and easy to use API.

Test results

Synchronous

No parallel.js used, all evaluations run in a single JS thread

664 ms with 88959 (longer) values 358 ms with 89548 (shorter) values 93 ms with 31552 values 10 ms with 2994 values 0 ms with 79 values 1 ms with 137 values 1 ms with 823 values 0 ms with 360 values

This is too slow and what we want to improve using parallel.js

Using paralleljs with default options

Same with maxWorkers = 16 (= number of SMT virtual CPU cores)

2067 ms with 88959 values 2036 ms with 89548 values 907 ms with 31552 values 332 ms with 2994 values 272 ms with 79 values 287 ms with 137 values 287 ms with 823 values 280 ms with 360 values

It's 300 times slower for small amount of data. This is might be the overhead of creating worker threads.

It's even 3-6 times as slow for large amount of data. This is somewhat surprising, because it cannot be the overhead of creating the worker threads, because they should be limited to 16 and be re-used for additional data, so I would expect at worst the time of the synchronous execution, plus the 280ms overhead of creating the workers, plus a tiny little overhead for the copy of the data. That would be around 700 ms for the second test, but it's actually 2000 ms, almost 3 times as long.

Interestingly, the first test (longer strings) and second test (shorter test) cost about the same time. This makes a big difference in the synchronous execution, but not with parallel.js, where the larger and shorter strings take about the same time. So, the slowdown isn't due to the pure data copy time. (See also the note to the last test below.) This shows that the amount of data has less of an influence than other factors.

This indicates that there might be something wrong (very inefficient) in parallel.js. There is some significant overhead per job, e.g. that workers are not re-used, by re-created all the time for each job, or some other huge overhead per job.

with maxWorkers = 8

(= number of physical CPU cores without SMT)

2002 ms with 88959 (longer) values 1782 ms with 89548 (shorter) values 778 ms with 31552 values 198 ms with 2994 values 156 ms with 79 values 166 ms with 137 values 173 ms with 823 values 153 ms with 360 values

Split up array

Starting only as many jobs as CPU cores, by manually splitting up the value array before running Parallel

593ms with 88959 values 526ms with 89548 values 357ms with 31552 values 304ms with 2994 values 285ms with 79 values 285ms with 137 values 290ms with 823 values 284ms with 360 values

Here, I first manually split up the data array into one array per CPU core, leading to an nested array with 16 elements, each being an array with thousands of values. This feeds a large array to each worker, one per CPU core. Then, within (!) the worker, I make a synchronous loop over the second level array and the thousands of values.

This is to test whether parallel.js is inefficiently distributing the work, e.g. by creating too many workers, or something with a similar effect. This appears to be the case.

I would expect that Parallel.js creates 1 worker per maxWorker, then makes a synchronous loop over the data and runs the function on each value. If it did, it should get speeds very similar to this here. Given that Parallel.js is about 4 times slower, that indicates that there's something to improve.

Split up array, with 8 workers

578ms with 88959 values 457ms with 89548 values 228ms with 31552 values 167ms with 2994 values 142ms with 79 values 147ms with 137 values 155ms with 823 values 149ms with 360 values

This is coming closest to the synchronous version, because it's also sl

1 Worker, 1 Job

This is the same as the last 2 cases, just with core = 1. This means there is no parallelism and the entire data array is passed into a single worker. This is effectively the same as the synchronous version without any parallel.js, just that it's running in a single worker created by Parallel.js.

1178ms with 88959 values 838ms with 89548 values 316ms with 31552 values 83ms with 2994 values 48ms with 79 values 45ms with 137 values 60ms with 823 values 52ms with 360 values

This should be slightly slower than the synchronous version, due to the the overhead of creating the workers, and a lot slower than the parallel one, due to only 1 CPU core being used. However, it's a lot faster than the non-split parallel version, which should not be the case.

This also shows that the data copy, which still has to happen here, is a lesser factor. The data copy needs to happen here as well. Yet, it is a lot faster than the non-split version, which indicates that the data copy is not the primary reason for the slowness.

Conclusion

Using Parallel.js in this case is dramatically slower than the synchronous version, by 1 or 2 orders of magnitude.

Inherent thread creation overhead

There is a certain overhead of using Parallel.js, which is very significant for small inputs: hundreds of times slower, i.e. 2 orders of magnitude slower. That is due to the fundamental cost of creating threads or processes.

This could be worked around by the caller, by checking the amount of input and avoiding Parallel.js in these cases. The factors to consider are: Amount of input data, execution time per function call, and execution time per thread creation.

Data copy overhead

There is also an overhead for copying the data. However, while that is there, results show that this is not the primary reasons for the slowdown.

Needless overhead

There is a huge overhead per job, not just per worker.

The results suggest that Parallel.js creates more workers than maxWorkers, or has other significant overheads per job with a similar effect.

There should be no overhead per job, only per worker and for the data transfer.

It would be good to look into this.

May 14 '20 02:05 benbucksch

Test code

This is a reduced test case. It's not the exact code I ran, but similar to what I tested with.

Syncronous

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
let results =  matches.map(match => leven(input, match));
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

Using paralleljs

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
let parallel = new Parallel(matches, { env: { input } })
let results = await parallel
  .require(leven)
  .map(match => leven(global.env.input, match));
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

Split-up the array

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
const cores = 16;
let lengthPerCore = Math.ceil(matches.length / cores);
let worksplit = [];
for (let i = 0; i < cores; i++) {
  worksplit.push(matches.slice(i * lengthPerCore, i * lengthPerCore + lengthPerCore));
}
let parallel = new Parallel(worksplit, { env: { inputText }, maxWorkers: cores })
let results = await parallel
  .require(leven)
  .map(valuesSplit =>
    .map(match => leven(global.env.input, match)));
results = results.flat(1)
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

May 14 '20 02:05 benbucksch

Cause

Looking at the Parallel.js code, the code does seem to try to re-use workers in parallel.map(). It's possible that there are bugs which prevent the reuse of workers, but I'll presume that this is not the case.

The message passing between controlling process and the spawned worker process might be slow. I can see in the code that the data is passed to the worker one by one, and returned one by one, one job at a time. That's 180000 messages for the first test case. It is quite likely this constant back and forth between the processes that is so slow. It would match and explain the numbers that I measured.

Fix

The fix is relatively simple: Pass all values at once into a worker. I.e. do what I did with the split-up code. That leads to a 3x to 4x speedup for workloads with 100000 calls.

Edge case

There might be special cases where there are few jobs, but the running time varies dramatically between different jobs. For such cases, the current design of one by one is better. So, you could enable it based on the number of jobs/values, e.g. 100, where the back and forth overhead becomes significant and differences between the job execution times likely level out.

May 14 '20 03:05 benbucksch

parallel.js parallel.js copied to clipboard

Parallel.js is slow due to unnecessary per-job overhead

Test results

Synchronous

Using paralleljs with default options

with maxWorkers = 8

Split up array

Split up array, with 8 workers

1 Worker, 1 Job

Conclusion

Inherent thread creation overhead

Data copy overhead

Needless overhead

Test code

Syncronous

Using paralleljs

Split-up the array

Cause

Fix

Edge case

parallel.js
parallel.js copied to clipboard