Higher than expected memory usage when using yyjson_read_file
Reading a large (4.0GB) file with yyjson_read_file results in significantly higher (2.5x) memory usage than expected. The exact version of yyjson.c/h used is here: https://github.com/TkTech/py_yyjson/blob/main/yyjson/yyjson.c
Although yyjson_read_fp is allocating a buffer (with padding) for the contents on line 6930:
/* read the entire file in one call */
buf_size = (usize)file_size + YYJSON_PADDING_SIZE;
buf = alc.malloc(alc.ctx, buf_size);
if (buf == NULL) {
return_err(MEMORY_ALLOCATION, "fail to alloc memory");
}
if (fread_safe(buf, (usize)file_size, file) != (usize)file_size) {
return_err(FILE_READ, "file reading failed");
}
and then adds YYJSON_READ_INSITU to the flags on line 6971:
/* read the entire file in one call */
buf_size = (usize)file_size + YYJSON_PADDING_SIZE;
buf = alc.malloc(alc.ctx, buf_size);
if (buf == NULL) {
return_err(MEMORY_ALLOCATION, "fail to alloc memory");
}
if (fread_safe(buf, (usize)file_size, file) != (usize)file_size) {
return_err(FILE_READ, "file reading failed");
}
Because the JSON file is minified, it ends up calling read_root_minify on line 6830:
/* read json document */
if (likely(char_is_container(*cur))) {
if (char_is_space(cur[1]) && char_is_space(cur[2])) {
doc = read_root_pretty(hdr, cur, end, alc, flg, err);
} else {
doc = read_root_minify(hdr, cur, end, alc, flg, err);
}
} else {
doc = read_root_single(hdr, cur, end, alc, flg, err);
}
and on line 5948, read_root_minify will allocate a 10.5GB buffer:
val_hdr = (yyjson_val *)alc.malloc(alc.ctx, alc_len * sizeof(yyjson_val));
read_root_pretty appears to be doing the same. Taking 10.5GB to parse a 4GB document that is almost exclusively repeating strings with INSITU set is definitely unexpected. If ~2.5x is the expected ratio of memory needed to parse any document, it should be part of the documentation.
Yes, the ~2.5x ratio is as expected.
Before parsing JSON, yyjson estimates the number of yyjson_val and pre-allocates memory.
For your example, 4GB / YYJSON_READER_ESTIMATED_MINIFY_RATIO * sizeof(yyjson_val) => 10.67GB.
You can find these ratios here:
https://github.com/ibireme/yyjson/blob/ad77257e5dd959f52237302cd2fa59b0590fffef/src/yyjson.c#L287-L306
If needed, I can expose those ratios as compile-time params, or add a read_flag to disable the pre-allocation.
Looking through the code, it's possible I've very much misunderstood the intention of INSITU. From the documentation:
Read the input data in-situ. This option allows the reader to modify and use input data to store string values, which can increase reading speed slightly. The caller should hold the input data before free the document. The input data must be padded by at least YYJSON_PADDING_SIZE bytes. For example: [1,2] should be [1,2]\0\0\0\0, input length should be 5.
To me, this says that it's going to be referencing the strings in the input document instead of copying them into the string pool. Considering the document is almost entirely strings, I expect the final memory to be just ``len(source_document) + (num_of_val * sizeof(yyjson_val))`.
Even disabling the pre-allocation, the final memory usage ends up the same.
The yyjson_doc has a str pool and a val pool.
As parsing goes on, the val pool keeps growing, each time it grows by 1.5x.
So after disable pre-allocation, the final memory will be between len(source_document) + (num_of_val * sizeof(yyjson_val)) and len(source_document) + 1.5 * (num_of_val * sizeof(yyjson_val)).
You can try setting the ratio to a large value, for example, #define YYJSON_READER_ESTIMATED_MINIFY_RATIO 10000, then see if the memory usage is as expected.
This should be similar to disabling pre-allocation.
#define YYJSON_READER_ESTIMATED_MINIFY_RATIO 10000 resulted in a tighter curve, but the final size is unchanged. I think something is up here or the size of a yyjson_val is much larger than I thought.
This produces a sample that has more tokens then the original document but still shows the issue and is suitable for tests:
import json
import random
import string
def generate_random_string(length: int) -> str:
"""Generate a random string of specified length."""
return "".join(
random.choices(string.ascii_letters + string.digits, k=length)
)
def generate_test_object() -> dict[str, str]:
"""Generate a single test object with random key-value pairs."""
return {
"id": generate_random_string(10),
"name": generate_random_string(15),
"description": generate_random_string(50),
"category": generate_random_string(8),
}
def generate_large_json(
target_size_gb: int = 4, output_file: str = "test_data.json"
) -> None:
"""
Generate a JSON file of approximately the specified size in GB.
"""
target_size_bytes = target_size_gb * 1024 * 1024 * 1024
current_size = 0
print(f"Generating approximately {target_size_gb}GB of JSON data...")
with open(output_file, "w") as f:
f.write("[\n")
while current_size < target_size_bytes:
obj = generate_test_object()
json_str = json.dumps(obj)
if current_size > 0:
f.write(",\n")
f.write(json_str)
current_size = f.tell()
f.write("\n]")
final_size_gb = current_size / (1024 * 1024 * 1024)
print(f"Done! Generated {final_size_gb:.2f}GB of JSON data")
if __name__ == "__main__":
generate_large_json()
This will give a document that is ~4GB and uses ~10GB of memory, with the vast majority of it being strings in an initial, padded blob with INSITU.
I tested this test_data.json:
// patch: #define YYJSON_READER_ESTIMATED_MINIFY_RATIO 10000
static void *my_malloc(void *ctx, usize size) {
printf("malloc: %zu\n", size);
return malloc(size);
}
static void *my_realloc(void *ctx, void *ptr, usize old_size, usize size) {
printf("relloc: %zu->%zu\n", old_size, size);
return realloc(ptr, size);
}
static void my_free(void *ctx, void *ptr) {
printf("free\n");
free(ptr);
}
static const yyjson_alc MY_ALC = {
my_malloc,
my_realloc,
my_free,
NULL
};
int main(void) {
yyjson_doc *doc = yyjson_read_file("test_data.json", 0, &MY_ALC, NULL);
size_t val_count = yyjson_doc_get_val_count(doc);
yyjson_doc_free(doc);
printf("-----\n");
printf("total val count: %zu\n", val_count);
printf("expected val pool size: [%zu, %zu]\n",
val_count * sizeof(yyjson_val),
(usize)(1.5 * val_count * sizeof(yyjson_val)));
return 0;
}
And here's the result:
malloc: 4294967364 // input str size
malloc: 6872064 // initial val pool size
relloc: 6872064->10308096
relloc: 10308096->15462144
relloc: 15462144->23193216
relloc: 23193216->34789824
relloc: 34789824->52184736
relloc: 52184736->78277104
relloc: 78277104->117415648
relloc: 117415648->176123472
relloc: 176123472->264185200
relloc: 264185200->396277792
relloc: 396277792->594416688
relloc: 594416688->891625024
relloc: 891625024->1337437536
relloc: 1337437536->2006156304
relloc: 2006156304->3009234448
relloc: 3009234448->4513851664 // final val pool size
free
free
-----
total val count: 272216242
expected val pool size: [4355459872, 6533189808]
It shows the final memory usage is 4GB (str) + 4.2GB (vals). I noticed similar memory consumption in the System Monitor too.
Maybe you could add some logs in the allocator. Just check if the output matches what we've got here.