js-bson
js-bson copied to clipboard
perf(NODE-6246): Significantly improve memory usage and performance of ObjectId
Description
This MR refactors ObjectId class to persist strings instead of a raw Buffer, which greatly reduces memory usage and improves performance of ObjectIds.
Why?
- Buffer (or TypedArray) is inefficient at storing small amount of data
- Buffer of size 12 takes 96 bytes in v8! ObjectId with buffer is 128 bytes each
- string of size 24 takes 40 bytes. ObjectId with String is 72 bytes
Performance Improvements
| Operation | ObjectId (ops/sec) | ObjectId STRING (ops/sec) | Relative Improvement (%) |
|---|---|---|---|
| new ObjectId() | 10,398,719 | 11,263,765 | 8.32% |
| new ObjectId(string) | 4,965,933 | 14,869,618 | 199.43% |
| deserialize | 6,955,399 | 6,431,123 | -7.54% |
| serialize | 3,881,206 | 3,247,690 | -16.32% |
| fromExtendedJSON | 4,821,665 | 50,886,088 | 955.36% |
| toExtendedJSON | 10,988,941 | 134,783,499 | 1126.54% |
| toHexString | 10,899,295 | 131,767,192 | 616.63% |
| .equals(objectId) | 39,653,912 | 44,413,312 | 12.00% |
| .equals(string) | 7,691,853 | 55,222,764 | 617.94% |
| .isValid(string) | 137,247 | 21,470,293 | 15543.49% |
Real world use cases
In the benchmarks above you can see there are some performance regressions with the raw serialize/deserializing BSON. This is because we now need to convert buffer to/from string instead of just copying bytes, but this regression seems to be moot during "real world" benchmarks.
I believe the 2 most common use cases are
-
- Retrieving documents from the DB and returning them to the user.
-
- Creating a new document and persisting to the DB
Using the MFlix Sample Dataset "Comments" documents.
| Operation | ObjectId (ops/sec) | ObjectId STRING (ops/sec) | Relative Improvement (%) |
|---|---|---|---|
| GET 100 docs (deserialize + stringily) | 3,762 | 3,855 | 2.46% |
| Create new doc (create + serialize) | 673,069 | 710,472 | 5.56% |
Memory Improvements
With this change, retained memory usage is reduced by ~45% (ObjectId size decreases from 128 bytes to 72 bytes). I've also removed an extra buffer allocation during deserialization which reduces the strain on the buffer pool, reducing the likelihood of the internal buffer pool having to allocate another block of Buffer.poolSize.
By reducing the amount of memory that needs to be allocated we are further improving performance since garbage collection is quite expensive.
Example
Using the MFlix Sample Dataset "Comments" collection. Grabbing 1 million documents
const bef = process.memoryUsage().rss;
const docs = await comments.find({}, { limit: 1000000 }).toArray() // lets grab 1 million documents
const aft = process.memoryUsage().rss;
console.log(`Memory used: ${Math.floor((aft - bef) / 1024 / 1024)}MB`);
BEFORE: 654 MB used
AFTER: 343 MB used
-47% reduction in memory usage
cacheBuffer Option
Similar to the previous cacheHexString static variable, this MR adds cacheBuffer option that also persists the Buffer on the ObjectId to speed up some operations that require buffer, such as .id
Previous ObjectID w/ cacheHexString VS New ObjectId w/ cacheBuffer
| Operation | bson (ops/sec) | bsonSTR (ops/sec) | Relative Improvement (%) |
|---|---|---|---|
| new ObjectId() | 5,535,118 | 5,481,568 | -0.97% |
| new ObjectId() + serialize | 5,621,583 | 5,635,728 | 0.25% |
| new ObjectId(string) | 3,794,209 | 5,363,446 | 41.36% |
| deserialize | 3,788,455 | 6,131,230 | 61.85% |
| serialize | 67,014,370 | 80,864,967 | 20.65% |
| toHexString | 71,372,677 | 79,949,268 | 12.02% |
| equals | 40,714,896 | 45,167,881 | 10.95% |
| equals-string | 20,697,929 | 42,426,021 | 105.00% |
| getTimestamp | 10,554,392 | 10,591,691 | 0.35% |
| createFromTime | 6,479,506 | 5,801,031 | -10.47% |
| isValid | 128,998 | 20,637,794 | 15895.85% |
What is changing?
Is there new documentation needed for these changes?
This MR deprecates cacheHexString which may need to be documented.
This MR adds cacheBuffer which may need to be documented
What is the motivation for this change?
We are been running into memory issues while pulling large Mongo result sets into memory, and after lots of memory profiling the issue seems to be related to ObjectId, specifically how memory inefficient Buffer/ArrayBuffer is when storing lots of small Buffers.
We were expecting an ObjectId to consume ~12bytes of memory (+ some overhead), but in reality this consumes 128 bytes per ObjectId (96 bytes for just the Buffer).
I opened an issue with the NodeJS Performance team but this appears to be working as designed: https://github.com/nodejs/performance/issues/173
Storing a string in Node/V8 is much more memory efficient since it's a primitive. A 24 character hex string only consumes 40 bytes of memory, AND it's much faster to serialize/deserialize.
Release Highlight
Fill in title or leave empty for no highlight
Double check the following
- [x] Ran
npm run check:lintscript - [x] Self-review completed using the steps outlined here
- [x] PR title follows the correct format:
type(NODE-xxxx)[!]: description- Example:
feat(NODE-1234)!: rewriting everything in coffeescript
- Example:
- [x] Changes are covered by tests
- [x] New TODOs have a related JIRA ticket
You can create a new generate that outputs string, like this:
/**
* Generate a 12 byte id in hex string used in ObjectId's
*
* @param time - pass in a second based timestamp.
*/
static generateString(time?: number): string {
if ('number' !== typeof time) {
time = Math.floor(Date.now() / 1000);
}
const inc = ObjectId.getInc();
// set PROCESS_UNIQUE if yet not initialized
if (PROCESS_UNIQUE === null) {
PROCESS_UNIQUE = ByteUtils.randomBytes(5);
}
if (PROCESS_UNIQUE_START_SLICE_HEX === null) {
PROCESS_UNIQUE_START_SLICE_HEX = ByteUtils.toHex(PROCESS_UNIQUE);
}
let finalId = (time % (2 ** 32)).toString(16);
finalId += PROCESS_UNIQUE_START_SLICE_HEX;
finalId += (inc & 0xffffff).toString(16).padStart(6, '0');
return finalId;
}
From my tests, it's slower than using buffer only but faster if you need to convert the buffer to hex:
<Buffer 7b 5b 0f 3e b7 ca ef 8a 6b 08 ca 28>
Time generate buffer: 348.585796 ms
7b5b0f3eb7caef8a6ba160a8
Time generate buffer.toString('hex'): 1175.228544 ms
7b5b0f3eb7caef8a6ba160a8
Time generateString: 780.711906 ms
Benchmark:
const { performance } = require('perf_hooks');
const jsBson = require('./lib/bson.cjs');
let start = performance.now();
let r;
const now = Date.now();
for (let i = 0; i < 1e7; i++) {
r = jsBson.ObjectId.generate(now);
}
console.log(r);
console.log(`Time generate buffer: ${performance.now() - start} ms`);
start = performance.now();
for (let i = 0; i < 1e7; i++) {
r = jsBson.ObjectId.generate(now).toString('hex');
}
console.log(r);
console.log(`Time generate buffer.toString('hex'): ${performance.now() - start} ms`);
start = performance.now();
for (let i = 0; i < 1e7; i++) {
r = jsBson.ObjectId.generateString(now);
}
console.log(r);
console.log(`Time generateString: ${performance.now() - start} ms`);
Hey folks, just wanted to drop a heads-up that you won't hear back properly for a bit due to the holiday 🎆 but thanks so much for reaching out about this issue! @SeanReece This looks really thoroughly researched I am looking forward to getting familiar with the details!
@H4ad this is great! I was hesitant to touch generate but I like what you’ve done here. I’ll try to roll it into this change tomorrow and update the benchmarks. You also got me thinking we could use something like parseInt(__id.substring(0,8), 16) to get the time out of the hex string instead of using buffer. If that works and is more performant then we could drop the cacheBuffer option entirely. 🤔
You also got me thinking we could use something like parseInt(__id.substring(0,8), 16) to get the time out of the hex string instead of using buffer.
Yeah, this will work new Date(parseInt(this.__id.substring(0,8), 16) * 1000).
If that works and is more performant then we could drop the cacheBuffer option entirely.
As a compatibility layer, I think keeping the buffer is still worthy to have, in case there is someone relying in something that can only be done via buffer.
@nbbeeken I've marked the PR ready for review again 👍 Looking forward to your input on this. Let me know if you have any questions.
@nbbeeken I really appreciate the thorough review 👍 Sorry about all the test failures. I think I got so caught up with benchmarking that completely slipped my mind. I'll get the tests running!
Got any tips to get the BSON benchmarks running on my machine? Even on a clean main, check:granular-bench seems to chug for a bit then spits out "Error: Benchmarks did not run successfully", and check:spec-bench just exits without completing. I would love to be able to test with the official benchmarks.
I do expect there to be some regressions with raw serialization/deserialization since we're now converting/encoding a Buffer to a hex string, instead of just copying bytes, so running a benchmark that just serializes/deserializes is going to compound this. I still believe that through real-life use cases (combinations of deserialization+toJSON, etc), we shouldn't see any performance regressions. AND I'm sincerely hoping that we'll see much better hex encoding performance in V8 once this lands: https://github.com/tc39/proposal-arraybuffer-base64
While I was originally brainstorming this change I also played with the idea of having a large buffer to store a pool of objectIds, since the memory inefficiency comes from storing many small buffers. This would mean we wouldn't have any performance regressions from encoding strings on serialization/deserialization, BUT handling garbage collection / trimming the pool is definitely not an easy problem to solve.
Another idea is to have the string representation be configurable as a flag (ie. enableLowMemoryOID) which would trigger the use of string based ObjectIds, but this means a lot more branches, tests, and code to manage.
Got any tips to get the BSON benchmarks running on my machine?
There's a small bug relating to handling paths I think that makes the benchmarks only run on linux.
I am able to get it to run on my Mac with the following command:
env LIBRARY=$(pwd) ITERATIONS=1000 WARMUP=1000 npm run check:granular-bench
Then I wrote a quick script to merge two test/bench/etc/resultsCollectedMeans.json files and get the percent diff. I have a clean main and this branch checked out in two separate folders, so I can just change the env variable to get results from each.
I do expect there to be some regressions with raw serialization/deserialization since we're now converting/encoding a Buffer to a hex string, instead of just copying bytes, so running a benchmark that just serializes/deserializes is going to compound this.
That is understandable, right now our benchmarks are geared towards throughput, hence the mb/s units, we're testing to make sure that we continue to be able to process the same amount of data as we have previously. This does focus solely on DB interaction side of things and does not measure the throughput of a total application that maybe serializing everything to JSON after it leaves the driver. Possibly that is an area we should look at to see if it is seriously impacted.
While I was originally brainstorming this change I also played with the idea of having a large buffer to store a pool of objectIds, since the memory inefficiency comes from storing many small buffers
Isn't this what allocUnsafe does (exposed through ByteUtils.allocateUnsafe)? And if I understand it correctly, it has a large ArrayBuffer that it hands our slices of, and when it runs out it starts a new one (not extending the old one) when all the slices (views) of that old pool leave scope, so can the old large pool. Maybe we can make sure ObjectId is using allocUnsafe wherever it can?
Another idea is to have the string representation be configurable as a flag (ie. enableLowMemoryOID)
Yes this is a possibility, but I agree with the sentiment of branching code, I think we should and can find an acceptable middle ground between performance and memory rather than that 🚀
@nbbeeken I'm also an a Mac (M1). Still no luck running the benchmarks but I'm adding some debug logs to see if I can track it down 🤔
Isn't this what allocUnsafe does
Yes NodeJS Buffer uses an internal pool to pre-allocate a Buffer of size Buffer.poolSize for performance reasons only, so that calls to allocateUnsafe are faster, but each call to allocateUnsafe still creates a new instance of Buffer/Uint8Array. The memory efficiency comes directly from Uint8Array/TypedArray in V8, where each instance of TypedArray has a bunch of pointers etc. Enabling pointerCompression in node would most likely reduce the impact of this, but that's not so easily done.
Here's some details I could find from a V8 developer: https://stackoverflow.com/questions/45803829/memory-overhead-of-typed-arrays-vs-strings/45808835#45808835
We can easily test this out by allocating some Buffers and checking the memoryUsage before/after
const SIZE = 1_000_000; // creating 1 million buffers
const tests = {
allocUnsafe: () => {
const arr = [];
for (let i = 0; i < SIZE; i++) {
arr.push(Buffer.allocUnsafe(12));
}
return arr;
},
emulatePoolUint8Array: () => {
const buffer = new ArrayBuffer(SIZE * 12);
const arr = [];
for (let i = 0; i < SIZE; i++) {
arr.push(new Uint8Array(buffer, i * 12, 12));
}
return arr;
},
createSingleBuffer: () => {
return Buffer.allocUnsafe(SIZE * 12);
},
};
Object.keys(tests).forEach((key) => {
const bef = process.memoryUsage();
const res = tests[key]();
const aft = process.memoryUsage();
console.log(`${key} used: ${Math.round((aft.rss + aft.external - (bef.rss + bef.external)) / 1024 / 1024)}MB`);
});
1000000 buffers used: 167MB 1000000 Uint8Array used: 140MB Single big buffer used: 11MB
but each call to allocateUnsafe still creates a new instance of Buffer/Uint8Array
Right, Right, that is the center of the issue. Thanks for breaking that down.
While I was originally brainstorming this change I also played with the idea of having a large buffer to store a pool of objectIds, since the memory inefficiency comes from storing many small buffers.
Since we have isolated our interactions with buffers to the mini ByteUtils library maybe it is possible to manage our own pool as a single Uint8Array and keep an offset into it inside the ObjectId. I think we can clone the Node.js approach where ObjectIds will share references to this pool but eventually, they will go out of scope leaving old pools unreachable and GC-able.
Updated perf
| bench | main | oid_639cc3c | percDiff |
|---|---|---|---|
| objectid_singleFieldDocument._serialize_bson | 9.0120 | 8.5721 | -5.0032 |
| objectid_singleElementArray._serialize_bson | 9.4766 | 10.5094 | 10.3357 |
| objectid_array_10._serialize_bson | 31.8439 | 30.9114 | -2.9718 |
| objectid_array_100._serialize_bson | 206.7586 | 81.2777 | -87.1285 |
| objectid_array_1000._serialize_bson | 222.2954 | 95.2045 | -80.0572 |
| objectid_singleFieldDocument._deserialize_bson | 11.6742 | 11.9383 | 2.2366 |
| objectid_singleElementArray._deserialize_bson | 10.4276 | 10.9879 | 5.2325 |
| objectid_array_10._deserialize_bson | 53.2898 | 32.9102 | -47.2845 |
| objectid_array_100._deserialize_bson | 108.2602 | 67.9801 | -45.7104 |
| objectid_array_1000._deserialize_bson | 131.2962 | 77.6481 | -51.3516 |
I've run an autocannon test against a server that calls toArray on 100,000 documents and JSON stringifies them. This is my attempt at "real worlding" the perf test.
autocannon http://localhost:8080 -d 10 -c 30 -w 3
MB/s throughput of a hyper-express server
main { avg: 24.2327552, median: 22.446079 }
oid { avg: 19.296665609999998, median: 13.467647 }
percDiff { avg: '22.679%', median: '50.000%' }
they will go out of scope leaving old pools unreachable and GC-able.
@nbbeeken You're a genius 😄 Don't know why this never occurred to me.
Wow, I feel like a bit of idiot here 😄. For some reason, I assumed Node was maintaining a single large pool and just compacting/trimming old unused space. But, looking at the implementation it looks super simple, it just keeps a Uint8Array of size Buffer.poolSize and an offset. Once offset + size has reached the poolSize, a new pool is created and the old pool reference is dropped so it's theoretically available for GC.
I think we can replicate this easily, reduce memory usage even more, and keep our serialization/deserialization performance. Even if we completely throw out these string changes, I've learned a lot here and there are lots of non-string related changes that would improve performance.
Let me throw together a test to see if this will work. Here's what I'm thinking.
- byte-utils will handle the pool creation and offset.
- poolSize can be configurable. (default to ~1000 ObjectIds?)
- Each ObjectId instance must hold a reference to its buffer pool and the OID offset
- 16 bytes for Number + pointer saves us ~24bytes per objectId over string implementation
You're a genius 😄
tyty! But I must give credit to @addaleax for suggesting the idea of a Uint8Array pool. I have read Node's pool implementation before so I figured we could kindly steal that approach.
Even if we completely throw out these string changes, I've learned a lot here and there are lots of non-string related changes that would improve performance.
Yea! and I've learned a lot here too. I think it was not completely out of the question to begin with. I can imagine most use cases probably end up turning OIDs into hex strings anyway (JSON.stringifiy will do that), so the REST example could have shown a low impact to performance if it meant we just moved the hex-ify work "up" into BSON. Turns out it has too large of an impact, but there was potential there!
Let me throw together a test to see if this will work. Here's what I'm thinking.
Those sound like good jumping-off points. Seems like we intend to make an OID-specific pool, I'm not sure if it could be kept general easily 🤔 making it specific to OID means we don't have to return a byteLength along with byteOffset since it will always be 12. I think that's fine, we should focus on optimizing the specific case of OIDs given how common they are.
If you'd like to keep this PR as is for comparison, feel free to start a new one but continuing to work here is fine as well. This is really great work, thanks again for donating your time and talents! 💚
Closing in favour of https://github.com/mongodb/js-bson/pull/707
Thanks again for all your help @nbbeeken @H4ad. Looking forward to your input on the buffer pool