Optimization: Storage Efficiency on a Billion edges scale dataset
Description
I would like to be able to process a graph with a billion edges on a 32GB laptop.
Here is my input:
du -sh graph.parquet
3.7G graph.parquet
D select count(1) from read_parquet('graph.parquet');
┌──────────────────┐
│ count(1) │
│ int64 │
├──────────────────┤
│ 827497855 │
│ (827.50 million) │
└──────────────────┘
When it's imported into kuzu:
kuzu> match (a: node)-[b: rel]-(c: node) return count(1);
┌────────────┐
│ COUNT(1) │
│ INT64 │
├────────────┤
│ 1654995710 │
└────────────┘
(1 tuple)
(1 column)
Time: 4.40ms (compiling), 3588.53ms (executing)
kuzu> CALL storage_info('rel') RETURN sum(num_values);
┌─────────────────┐
│ SUM(num_values) │
│ INT128 │
├─────────────────┤
│ 4599560202 │
└─────────────────┘
(1 tuple)
(1 column)
kuzu> CALL storage_info('rel') RETURN sum(num_pages) * 4096 as num_bytes;
┌─────────────┐
│ num_bytes │
│ INT128 │
├─────────────┤
│ 86016495616 │
└─────────────┘
(1 tuple)
(1 column)
Time: 3.95ms (compiling), 36.56ms (executing)
In other words the table takes 86GB on disk. Queries that I want to run kill kuzu due to OOM.
In theory it should be possible to store this using:
827e6 * 4bytes + compressed COL and ROW pointers.
This is further exacerbated by the cost of reading this graph into memory for the purpose of running a graph algorithm. Most parallel libraries such as networkit or ParClusterers don't do zero copy optimizations.
The ask here is two fold:
- More compact disk storage
- Example queries to read graph into memory in arrow format using CSR or SuperCSR that could be used by a parallel algorithm library with zero-copy C++20 features.
Related: #5050, https://github.com/networkit/networkit/issues/1331
https://github.com/networkit/networkit/pull/1335 has a proposal to better integrate graph databases such as kuzu with parallel graph algorithms code. The idea is that kuzu would own the arrow formatted memory. The algo library would use the new immutable graph concept to iterate over it, but will not be responsible for managing the memory.
Requires more work, but basic tests are passing.
networkit/networkit#1335 has a proposal to better integrate graph databases such as kuzu with parallel graph algorithms code. The idea is that kuzu would own the arrow formatted memory. The algo library would use the new immutable graph concept to iterate over it, but will not be responsible for managing the memory.
Requires more work, but basic tests are passing.
hi @adsharma thanks! 👍
This branch has a mostly vibe coded VACUUM command implementation (nop, prints a message and exits).
I noticed that this was discussed in #4798 as well. Not sure what the resolution was.
Is this going in the right direction? Another way to do it is to allow attaching read-only parquet files as node/rel tables.
For implementation, I'm thinking something like COPY, but with some tunables set to pack storage more tightly and optimize for reads.
I looked at storage_info() in a bit more detail:
kuzu> CALL storage_info('knows') RETURN column_name, sum(num_pages) as pages order by pages desc;
┌───────────────────┬────────┐
│ column_name │ pages │
│ STRING │ INT128 │
├───────────────────┼────────┤
│ bwd__ID │ 7460 │
│ fwd__ID │ 7340 │
│ bwd_NBR_ID │ 6409 │
│ fwd_NBR_ID │ 6284 │
│ bwd_strength │ 2187 │
│ fwd_strength │ 2152 │
│ bwd_csr_offset │ 1288 │
│ fwd_csr_offset │ 1288 │
│ bwd_since │ 621 │
│ fwd_since │ 611 │
│ bwd_strength_null │ 31 │
│ bwd_since_null │ 31 │
│ bwd_csr_length │ 13 │
│ fwd_csr_length │ 13 │
│ fwd_since_null │ 6 │
│ fwd_strength_null │ 6 │
└───────────────────┴────────┘
kuzu> CALL storage_info('knows') where column_name = 'bwd__ID' RETURN *;
┌────────────┬───────────────┬───────────────┬───────────┬─────┬────────────┬─────────┬─────────┬──────────────────────┐
│ table_type │ node_group_id │ node_chunk_id │ residency │ ... │ num_values │ min │ max │ compression │
│ STRING │ INT64 │ INT64 │ STRING │ │ INT64 │ STRING │ STRING │ STRING │
├────────────┼───────────────┼───────────────┼───────────┼─────┼────────────┼─────────┼─────────┼──────────────────────┤
│ REL │ 0 │ -1 │ ON_DISK │ ... │ 655360 │ 2047 │ 8727038 │ INTEGER_BITPACKIN... │
│ REL │ 1 │ -1 │ ON_DISK │ ... │ 655360 │ 2048 │ 8983038 │ INTEGER_BITPACKIN... │
│ REL │ 2 │ -1 │ ON_DISK │ ... │ 655360 │ 53247 │ 8985086 │ INTEGER_BITPACKIN... │
│ REL │ 3 │ -1 │ ON_DISK │ ... │ 655360 │ 51199 │ 8987710 │ INTEGER_BITPACKIN... │
│ REL │ 4 │ -1 │ ON_DISK │ ... │ 655360 │ 4096 │ 8995327 │ INTEGER_BITPACKIN... │
│ REL │ 5 │ -1 │ ON_DISK │ ... │ 655360 │ 12287 │ 8993854 │ INTEGER_BITPACKIN... │
│ REL │ 6 │ -1 │ ON_DISK │ ... │ 819200 │ 0 │ 9987711 │ INTEGER_BITPACKIN... │
│ REL │ 7 │ -1 │ ON_DISK │ ... │ 655360 │ 0 │ 9128446 │ INTEGER_BITPACKIN... │
│ REL │ 8 │ -1 │ ON_DISK │ ... │ 655360 │ 1016384 │ 9980991 │ INTEGER_BITPACKIN... │
│ REL │ 9 │ -1 │ ON_DISK │ ... │ 655360 │ 1235519 │ 9985086 │ INTEGER_BITPACKIN... │
│ REL │ 10 │ -1 │ ON_DISK │ ... │ 655360 │ 1028671 │ 9987134 │ INTEGER_BITPACKIN... │
│ REL │ 11 │ -1 │ ON_DISK │ ... │ 655360 │ 1002048 │ 9993854 │ INTEGER_BITPACKIN... │
│ REL │ 12 │ -1 │ ON_DISK │ ... │ 655360 │ 1051200 │ 9998526 │ INTEGER_BITPACKIN... │
│ REL │ 13 │ -1 │ ON_DISK │ ... │ 655360 │ 1223231 │ 9995902 │ INTEGER_BITPACKIN... │
│ REL │ 14 │ -1 │ ON_DISK │ ... │ 655360 │ 1118784 │ 9989758 │ INTEGER_BITPACKIN... │
│ REL │ 15 │ -1 │ ON_DISK │ ... │ 169600 │ 1504703 │ 9999999 │ INTEGER_BITPACKIN... │
└
Bulk of the storage is in the ID mapping scheme. This won't go away by using VACUUM.
Potential optimizations:
- For undirected graphs, do not store forward and backward. Cutting storage by half.
- Allow exporting the graph with internal ids, so the user can see the mapping
- Allow re-importing the graph with no mapping cost.
If a scheme like this has already been considered, would love to read up more about past discussions on the topic.
For undirected graphs, this issue was opened and closed in the past: https://github.com/kuzudb/kuzu/issues/4320 Some relevant discussions there.
The feature seems to be functional! I'm using this script to test:
https://gist.github.com/adsharma/3f0b204c55c7c48d9f94cde730cd1e40
python create_relationship_table.py --rows 10000000 --direction fwd python create_relationship_table.py --rows 10000000 --direction both
304M kuzu-10m-both 206M kuzu-10m-fwdonly 118M kuzu-10m-nodes-only
struct KUZU_API internalID_t {
offset_t offset;
table_id_t tableID;
..
}
Looks like this structure takes 16 bytes. It's needed for the most general case. But in many cases all entries in the rel table point to the same table. So you could store tableID once in the schema as an optimization?
The next possible optimization is a table option to skip the internal id scheme. The user promises to provide the input in CSR-ready format so it could be stored without these ID mapping columns. We'd then need to design a way to read the CSR format into an arrow table for further processing by a graph algo library.
hi @adsharma
Looks like this structure takes 16 bytes. It's needed for the most general case. But in many cases all entries in the rel table point to the same table. So you could store tableID once in the schema as an optimization?
For on-disk storage, we should only store the offsets already, and tableID is only stored once as a common one.