orama 512MB seems to be the max supported file size for disk persistence plugin

Describe the bug

Using Orama with the persistence plugin, I seem to have hit a wall. While indexing some publications, everything was fine until the database grew. Now I keep getting the same error while trying to read from the persisted file:

node:buffer:711
    slice: (buf, start, end) => buf.hexSlice(start, end),
                                    ^

Error: Cannot create a string longer than 0x1fffffe8 characters
    at Object.slice (node:buffer:711:37)
    at Buffer.toString (node:buffer:863:14)
    at persist (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
    at async persistToFile (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async addPDF (file:///home/node/lib/utils.mjs:149:3)
    at async indexPublications (file:///home/node/lib/utils.mjs:218:9)
    at async Command.<anonymous> (file:///home/node/dbtool:59:5) {
  code: 'ERR_STRING_TOO_LONG'
}

Node.js v22.11.0
node@81095f0d53a4:~$ ./dbtool stats
file:///home/node/lib/utils.mjs:173
    size: filesize(Buffer.byteLength(JSON.stringify(db))),
                                          ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at getStats (file:///home/node/lib/utils.mjs:173:43)
    at async Command.<anonymous> (file:///home/node/dbtool:32:19)
    
node@81095f0d53a4:~$ du -h db.orama 
512M    db.orama

To Reproduce

Use Orama with the persistence plugin
Ingest a lot of docs until you reach 512MB in size
Watch your whole database go up in smoke

Expected behavior

Being able to reach more than 512MB in database size.

Environment Info

OS: Manjaro Linux 6.6.54
Node: v22.11.0
Orama: @orama/orama 3.0.2 @orama/plugin-data-persistence 3.0.2

Affected areas

Initialization, Data Insertion

Additional context

Only tried Linux so far, as it's my daily driver.

Nov 30 '24 13:11 bennyzen

Hi @bennyzen, how are you serializing the database? Via JSON, DPACK, or MessagePack?

Dec 06 '24 10:12 micheleriva

Ciao Michele,

first of all, thank you for this amazing project.

From my humble understanding, as I yet haven't studied the internals of Orama, I simply followed the instructions in the docs calling the provided persistToFile() and restoreFromFile() methods respectively, both with the "binary" argument.

There's a real chance that I've been delusional by ingesting so much data into the db, as it's maybe just not made for such volumes.

BTW: Did someone successfully store, persisted and restored more than 512MB of data, or is it just me having this kind of issue?

Dec 07 '24 08:12 bennyzen

Can you try persisting this data in a JSON format? Using the json option instead of the binary one. Built-in JSON support is far superior in JavaScript than binary support via third-party libs like msgpack or dpack.

As far as I know, 512MB shouldn't really be a problem. Especially in JSON!

Dec 07 '24 10:12 micheleriva

Yes, I'll surely try to persist using JSON. But it will take some time to embed and ingest all those records again to reach that volume.

The only thing that still boggles me is what I've come across here. If I understand that right, it means that the max string length has regressed back to 0.5GB. But as always, please correct me if I'm wrong.

Dec 07 '24 19:12 bennyzen

Here's a quick'n'dirty bare bones reproduction using both binary or json mode:

import { create, insert } from '@orama/orama'
import {
  persistToFile,
  restoreFromFile,
} from '@orama/plugin-data-persistence/server'

const inserts = 512 * 10
const blockSize = 1048576 / 10 // 1MB / 10, as a whole 1MB block would cause another error
const mode = 'json'

const payload = () => {
  let payload = ''
  for (let i = 0; i < blockSize; i++) {
    payload += 'a'
  }
  return payload
}

const db = create({
  schema: {
    payload: 'string',
  },
})

console.time('inserting')
for (let i = 0; i < inserts; i++) {
  await insert(db, {
    payload: payload(),
  })
}
console.timeEnd('inserting')

// persist the database to disk
console.time('persisting')
const path = await persistToFile(db, mode, 'db.dat')
console.timeEnd('persisting')

// restore the database from disk
console.time('restoring')
const restored = await restoreFromFile(mode, path)
console.timeEnd('restoring')

JSON mode yields this error:

inserting: 21.506s
file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50
            serialized = JSON.stringify(dbExport);
                              ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50:31)
    at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async file:///home/ben/tmp/orama-persist-limit/main.mjs:35:14

Node.js v22.11.0

BINARY mode yields this error:

inserting: 21.573s
node:buffer:711
    slice: (buf, start, end) => buf.hexSlice(start, end),
                                    ^

Error: Cannot create a string longer than 0x1fffffe8 characters
    at Object.slice (node:buffer:711:37)
    at Buffer.toString (node:buffer:863:14)
    at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
    at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async file:///home/ben/tmp/orama-persist-limit/main.mjs:34:14 {
  code: 'ERR_STRING_TOO_LONG'
}

Node.js v22.11.0

So yes, the limit seems to be 512MB. Correct?

Dec 08 '24 18:12 bennyzen

It shouldn't be. We're investigating, we'll keep you posted (cc. @matijagaspar, @faustoq)

Dec 10 '24 08:12 micheleriva

It's just an assumption and probably too vague to be useful, but couldn't it be mitigated by using eg. a streaming ndjson parser/serializer? It surely would involve some significant rework of the actual code-base, but IMHO would remove these constraining "limitations" and significantly reduce memory consumption on larger data volumes.

Another thing that I've noticed during testing: Field-size seems to be limited to 100KB (see rudimentary code above). Sure, which sane person puts 100 KB of data into a single field? But that's maybe stuff for another consideration/issue.

Dec 10 '24 11:12 bennyzen

I have the same problem ...JSON.stringify() is limited to 512mb. Based on the object (RawData) returned from save(db) I tried to ndjson the data. While this is easy for rawData.doc it's less trivial for rawData.index or even rawData.sorting.sorts The challenge is getting reasonably & equally sized chunks :/ that can then be read and serialized again.

I wonder how you do that on orama cloud ...I guess you are not rebuilding the index every time a search comes in?

Apr 24 '25 13:04 came

I have made a working prototype for streaming write/read of the database. Along with the use of MsgPack for serialization, I also incorporated gzip and zstd to compress the database.

Tested on:

NodeJS v23.11.0
WSL 2 + Ubuntu 24.04
Intel Core i5-10400

import fs from "node:fs";
import zlib from "node:zlib";
import { Readable } from 'node:stream';
import { pipeline } from 'node:stream/promises';

import type { AnyOrama } from '@orama/orama'
import { create, load, save } from '@orama/orama'
import { encode, decodeAsync } from "@msgpack/msgpack";

export type CompressionType = "none" | "gzip" | "zstd"

export async function persistToFile<T extends AnyOrama>(db: T, path: string, compression: CompressionType): Promise<string> {
  const dbExport = await save(db);

  const msgpack = encode(dbExport);
  const bufferExport = Buffer.from(msgpack.buffer, msgpack.byteOffset, msgpack.byteLength);

  if (compression === "none") {
    await pipeline(Readable.from(bufferExport), fs.createWriteStream(path));
  } else if (compression === "gzip") {
    await pipeline(Readable.from(bufferExport), zlib.createGzip(), fs.createWriteStream(path));
  } else if (compression === "zstd") {
    await pipeline(Readable.from(bufferExport), zlib.createZstdCompress(), fs.createWriteStream(path));
  } else {
    throw new Error("Unknown compression!");
  }

  return path;
}

export async function restoreFromFile<T extends AnyOrama>(path: string, compression: CompressionType): Promise<T> {
  const db = create({
    schema: {
      __placeholder: 'string'
    }
  });

  const decodeCb = async (chunk: unknown) => {
    // @ts-expect-error 
    const res = await decodeAsync(chunk)
    // @ts-expect-error 
    load(db, res);
  }

  if (compression === "none") {
    await pipeline(fs.createReadStream(path), decodeCb);
  } else if (compression === "gzip") {
    await pipeline(fs.createReadStream(path), zlib.createGunzip(), decodeCb);
  } else if (compression === "zstd") {
    await pipeline(fs.createReadStream(path), zlib.createZstdDecompress(), decodeCb);
  } else {
    throw new Error("Unknown compression!");
  }

  return db as unknown as T;
}

Testing code: (adapted from @bennyzen )

import crypto from "node:crypto";

import { create, insert } from '@orama/orama'
import { persistToFile, restoreFromFile } from './persist-stream.js'

const inserts = 512 * 20
const blockSize = 1048576 / 100

const payload = () => {
  return crypto.randomBytes(blockSize).toString("hex");
}

const db = create({
  schema: {
    payload: 'string',
  },
})

console.time('inserting')
for (let i = 0; i < inserts; i++) {
  await insert(db, {
    payload: payload(),
  })
}
console.timeEnd('inserting')

// persist the database to disk
console.time('persisting: none')
await persistToFile(db, "data/db.msgpack", "none")
console.timeEnd('persisting: none')

console.time('persisting: gzip')
await persistToFile(db, "data/db.msgpack.gz", "gzip")
console.timeEnd('persisting: gzip')

console.time('persisting: zstd')
await persistToFile(db, "data/db.msgpack.zst", "zstd")
console.timeEnd('persisting: zstd')

// restore the database from disk
console.time('restoring: none')
const restoredNone = await restoreFromFile("data/db.msgpack", "none")
console.timeEnd('restoring: none')

console.time('restoring: gzip')
const restoredGzip = await restoreFromFile("data/db.msgpack.gz", "gzip")
console.timeEnd('restoring: gzip')

console.time('restoring: zstd')
const restoredZstd = await restoreFromFile("data/db.msgpack.zst", "zstd")
console.timeEnd('restoring: zstd')

Results:

inserting: 13.529s
persisting: none: 5.439s
persisting: gzip: 34.371s
persisting: zstd: 19.138s
restoring: none: 8.117s
restoring: gzip: 15.751s
restoring: zstd: 14.387s

File sizes:

db.msgpack = 1.3 GB
db.msgpack.gz = 559 MB
db.msgpack.zst = 541 MB

I also monitored that the node process takes about 2 GB of memory. In conclusion, I think streaming is the way to go for saving/loading large Orama database (on the server side, at least). Also, zstd offer a great compression ratio and speed compared to gzip.

Just don't use brotli because it is awfully slow.

May 17 '25 15:05 fahminlb33