512MB seems to be the max supported file size for disk persistence plugin
Describe the bug
Using Orama with the persistence plugin, I seem to have hit a wall. While indexing some publications, everything was fine until the database grew. Now I keep getting the same error while trying to read from the persisted file:
node:buffer:711
slice: (buf, start, end) => buf.hexSlice(start, end),
^
Error: Cannot create a string longer than 0x1fffffe8 characters
at Object.slice (node:buffer:711:37)
at Buffer.toString (node:buffer:863:14)
at persist (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
at async persistToFile (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
at async addPDF (file:///home/node/lib/utils.mjs:149:3)
at async indexPublications (file:///home/node/lib/utils.mjs:218:9)
at async Command.<anonymous> (file:///home/node/dbtool:59:5) {
code: 'ERR_STRING_TOO_LONG'
}
Node.js v22.11.0
node@81095f0d53a4:~$ ./dbtool stats
file:///home/node/lib/utils.mjs:173
size: filesize(Buffer.byteLength(JSON.stringify(db))),
^
RangeError: Invalid string length
at JSON.stringify (<anonymous>)
at getStats (file:///home/node/lib/utils.mjs:173:43)
at async Command.<anonymous> (file:///home/node/dbtool:32:19)
node@81095f0d53a4:~$ du -h db.orama
512M db.orama
To Reproduce
- Use Orama with the persistence plugin
- Ingest a lot of docs until you reach 512MB in size
- Watch your whole database go up in smoke
Expected behavior
Being able to reach more than 512MB in database size.
Environment Info
OS: Manjaro Linux 6.6.54
Node: v22.11.0
Orama: @orama/orama 3.0.2 @orama/plugin-data-persistence 3.0.2
Affected areas
Initialization, Data Insertion
Additional context
Only tried Linux so far, as it's my daily driver.
Hi @bennyzen, how are you serializing the database? Via JSON, DPACK, or MessagePack?
Ciao Michele,
first of all, thank you for this amazing project.
From my humble understanding, as I yet haven't studied the internals of Orama, I simply followed the instructions in the docs calling the provided persistToFile() and restoreFromFile() methods respectively, both with the "binary" argument.
There's a real chance that I've been delusional by ingesting so much data into the db, as it's maybe just not made for such volumes.
BTW: Did someone successfully store, persisted and restored more than 512MB of data, or is it just me having this kind of issue?
Can you try persisting this data in a JSON format? Using the json option instead of the binary one. Built-in JSON support is far superior in JavaScript than binary support via third-party libs like msgpack or dpack.
As far as I know, 512MB shouldn't really be a problem. Especially in JSON!
Yes, I'll surely try to persist using JSON. But it will take some time to embed and ingest all those records again to reach that volume.
The only thing that still boggles me is what I've come across here. If I understand that right, it means that the max string length has regressed back to 0.5GB. But as always, please correct me if I'm wrong.
Here's a quick'n'dirty bare bones reproduction using both binary or json mode:
import { create, insert } from '@orama/orama'
import {
persistToFile,
restoreFromFile,
} from '@orama/plugin-data-persistence/server'
const inserts = 512 * 10
const blockSize = 1048576 / 10 // 1MB / 10, as a whole 1MB block would cause another error
const mode = 'json'
const payload = () => {
let payload = ''
for (let i = 0; i < blockSize; i++) {
payload += 'a'
}
return payload
}
const db = create({
schema: {
payload: 'string',
},
})
console.time('inserting')
for (let i = 0; i < inserts; i++) {
await insert(db, {
payload: payload(),
})
}
console.timeEnd('inserting')
// persist the database to disk
console.time('persisting')
const path = await persistToFile(db, mode, 'db.dat')
console.timeEnd('persisting')
// restore the database from disk
console.time('restoring')
const restored = await restoreFromFile(mode, path)
console.timeEnd('restoring')
JSON mode yields this error:
inserting: 21.506s
file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50
serialized = JSON.stringify(dbExport);
^
RangeError: Invalid string length
at JSON.stringify (<anonymous>)
at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50:31)
at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
at async file:///home/ben/tmp/orama-persist-limit/main.mjs:35:14
Node.js v22.11.0
BINARY mode yields this error:
inserting: 21.573s
node:buffer:711
slice: (buf, start, end) => buf.hexSlice(start, end),
^
Error: Cannot create a string longer than 0x1fffffe8 characters
at Object.slice (node:buffer:711:37)
at Buffer.toString (node:buffer:863:14)
at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
at async file:///home/ben/tmp/orama-persist-limit/main.mjs:34:14 {
code: 'ERR_STRING_TOO_LONG'
}
Node.js v22.11.0
So yes, the limit seems to be 512MB. Correct?
It shouldn't be. We're investigating, we'll keep you posted (cc. @matijagaspar, @faustoq)
It's just an assumption and probably too vague to be useful, but couldn't it be mitigated by using eg. a streaming ndjson parser/serializer? It surely would involve some significant rework of the actual code-base, but IMHO would remove these constraining "limitations" and significantly reduce memory consumption on larger data volumes.
Another thing that I've noticed during testing: Field-size seems to be limited to 100KB (see rudimentary code above). Sure, which sane person puts 100 KB of data into a single field? But that's maybe stuff for another consideration/issue.
I have the same problem ...JSON.stringify() is limited to 512mb. Based on the object (RawData) returned from save(db) I tried to ndjson the data. While this is easy for rawData.doc it's less trivial for rawData.index or even rawData.sorting.sorts The challenge is getting reasonably & equally sized chunks :/ that can then be read and serialized again.
I wonder how you do that on orama cloud ...I guess you are not rebuilding the index every time a search comes in?
I have made a working prototype for streaming write/read of the database. Along with the use of MsgPack for serialization, I also incorporated gzip and zstd to compress the database.
Tested on:
- NodeJS v23.11.0
- WSL 2 + Ubuntu 24.04
- Intel Core i5-10400
import fs from "node:fs";
import zlib from "node:zlib";
import { Readable } from 'node:stream';
import { pipeline } from 'node:stream/promises';
import type { AnyOrama } from '@orama/orama'
import { create, load, save } from '@orama/orama'
import { encode, decodeAsync } from "@msgpack/msgpack";
export type CompressionType = "none" | "gzip" | "zstd"
export async function persistToFile<T extends AnyOrama>(db: T, path: string, compression: CompressionType): Promise<string> {
const dbExport = await save(db);
const msgpack = encode(dbExport);
const bufferExport = Buffer.from(msgpack.buffer, msgpack.byteOffset, msgpack.byteLength);
if (compression === "none") {
await pipeline(Readable.from(bufferExport), fs.createWriteStream(path));
} else if (compression === "gzip") {
await pipeline(Readable.from(bufferExport), zlib.createGzip(), fs.createWriteStream(path));
} else if (compression === "zstd") {
await pipeline(Readable.from(bufferExport), zlib.createZstdCompress(), fs.createWriteStream(path));
} else {
throw new Error("Unknown compression!");
}
return path;
}
export async function restoreFromFile<T extends AnyOrama>(path: string, compression: CompressionType): Promise<T> {
const db = create({
schema: {
__placeholder: 'string'
}
});
const decodeCb = async (chunk: unknown) => {
// @ts-expect-error
const res = await decodeAsync(chunk)
// @ts-expect-error
load(db, res);
}
if (compression === "none") {
await pipeline(fs.createReadStream(path), decodeCb);
} else if (compression === "gzip") {
await pipeline(fs.createReadStream(path), zlib.createGunzip(), decodeCb);
} else if (compression === "zstd") {
await pipeline(fs.createReadStream(path), zlib.createZstdDecompress(), decodeCb);
} else {
throw new Error("Unknown compression!");
}
return db as unknown as T;
}
Testing code: (adapted from @bennyzen )
import crypto from "node:crypto";
import { create, insert } from '@orama/orama'
import { persistToFile, restoreFromFile } from './persist-stream.js'
const inserts = 512 * 20
const blockSize = 1048576 / 100
const payload = () => {
return crypto.randomBytes(blockSize).toString("hex");
}
const db = create({
schema: {
payload: 'string',
},
})
console.time('inserting')
for (let i = 0; i < inserts; i++) {
await insert(db, {
payload: payload(),
})
}
console.timeEnd('inserting')
// persist the database to disk
console.time('persisting: none')
await persistToFile(db, "data/db.msgpack", "none")
console.timeEnd('persisting: none')
console.time('persisting: gzip')
await persistToFile(db, "data/db.msgpack.gz", "gzip")
console.timeEnd('persisting: gzip')
console.time('persisting: zstd')
await persistToFile(db, "data/db.msgpack.zst", "zstd")
console.timeEnd('persisting: zstd')
// restore the database from disk
console.time('restoring: none')
const restoredNone = await restoreFromFile("data/db.msgpack", "none")
console.timeEnd('restoring: none')
console.time('restoring: gzip')
const restoredGzip = await restoreFromFile("data/db.msgpack.gz", "gzip")
console.timeEnd('restoring: gzip')
console.time('restoring: zstd')
const restoredZstd = await restoreFromFile("data/db.msgpack.zst", "zstd")
console.timeEnd('restoring: zstd')
Results:
inserting: 13.529s
persisting: none: 5.439s
persisting: gzip: 34.371s
persisting: zstd: 19.138s
restoring: none: 8.117s
restoring: gzip: 15.751s
restoring: zstd: 14.387s
File sizes:
- db.msgpack = 1.3 GB
- db.msgpack.gz = 559 MB
- db.msgpack.zst = 541 MB
I also monitored that the node process takes about 2 GB of memory. In conclusion, I think streaming is the way to go for saving/loading large Orama database (on the server side, at least). Also, zstd offer a great compression ratio and speed compared to gzip.
Just don't use brotli because it is awfully slow.