project-m36 Use more optimal structure for `Relation`

Right now Relation seems to have a naive implementation that will certainly hog lots of memory.

It stores tuples in a list
It keeps the name and type of every attribute for every row

With a few lens tricks you could make the API just as nice (or nicer) but with a more optimal storage profile.

May 19 '17 18:05 3noch

I see now that the only thing kept in a list is the contents of RelationTupleSet, which is usually going be to short. So this may not be as big of a deal as I thought.

May 19 '17 18:05 3noch

Believe it or not, the implementation used to be even more naïve! The original implementation (still available in the git history, by the way) implemented tuples using Data.Map and tuple sets using Data.Set. Some basic heap profiling revealed that I would be unlikely to ever be able to generate and store 10s of millions of even arity-1 tuples! The current layout has the least amount of overhead compared to the other implementations I tested (see bigrel), but that certainly doesn't mean there isn't room for improvement.

When a relation is created, the tuple attributes should be shared amongst all the tuples, so the only memory cost should be an additional pointer. Having the attributes within the tuple context is handy because otherwise I would need to pass the attributes around with the tuple anyway to validate and operate on tuples, for example, when processing them one-by-one (such as with a join). When constructing a new set of tuples, we do go through some gymnastics in order to ensure that the tuples' vector and attributes are all in the same order (see reorderTuple)- otherwise we would not be able to share the attributes vector.

In summary, there is certainly room for improvement here, but heap profiling has shown the current layout to be pretty good. I think we could benefit from some better data structures and strictness analysis, but I don't see off-hand how lenses would help because, if I understand the suggestion correctly, we would need to pass around the Relation to almost every function, even when relations are not strictly relevant.

May 20 '17 13:05 agentm

Do you know if switching to pipes or conduits for tuples would reduce or increase memory usage when compared to a standard list?

May 20 '17 14:05 agentm

Very cool! I guess you did your homework on this one. pipes and conduit don't provide any containers so you'd still have to choose between a list, vector, map, etc.

May 20 '17 19:05 3noch

Are your benchmarks posted anywhere? They may be helpful in getting people on board with the project.

May 22 '17 18:05 3noch

Hm, I hadn't thought of posting the results because they just compared specific internal structures. Do you think that would be of general interest? I think there is still room for internals improvements- I merely chose the best option from a pool of naïve options.

May 23 '17 04:05 agentm

Of course, ideally, Project:M36 would be comparable to other RDBMSs. But even for now having some metrics would be helpful. For example, if someone is wondering, "Can I use this on my 500-user web app with a few thousands rows?" the metrics could help answer that question.

May 23 '17 13:05 3noch

I see- great point. Right now, there are no documented performance characteristics. That would certainly be useful to see how Project:M36 holds up.

May 23 '17 14:05 agentm

Right now, there are no documented performance characteristics

I got an app: 8 fields x 80000 tuples 21MB database in disk memory usage is about 200~240MB. It runs fine locally but gets killed by OOM killer in my 500MB-memory droplet. It seems that readFileDeserialise uses the memory most.

Oct 24 '22 13:10 YuMingLiao

Indeed, I think that this is one area where the streaming tuples should make a big difference. Currently, we read tuple expressions from disk all-at-once, so I can see how that could blow out the memory usage.

Maybe there is some low-hanging fruit we can address on the master branch. Previously, we used zlib compression for on-disk storage, but the on-disk storage here is low, so it probably won't help.

I'll create a memory-focused benchmark which we can use going forward.

Are you willing to share a heap profile here?

Oct 24 '22 14:10 agentm

I only execute a restricted expr query after app is initialized.

app

Oct 24 '22 20:10 YuMingLiao

Ah, I found some low-hanging fruit. While Project-M36 creates tuples with shared attributes, when reading them, it ends up with loads of duplicate attributes (one for each tuple), wasting the majority of memory. I'm working on a fix.

Oct 27 '22 02:10 agentm

@YuMingLiao , could you take the 105 branch for a spin? In my example testing using "OnDiskClient", the branch reduces memory usage by an order-of-magnitude by sharing a single Attributes vector for all tuples instead of one per tuple. Still not amazing, but certainly an improvement. The memory profile shows that the actual data in the atoms now consumes the most memory.

Oct 28 '22 14:10 agentm

To be clear, you will need to recreate the db from scratch since this changes the layout on-disk.

Oct 28 '22 14:10 agentm

nutrition-app

I don't understand how on-disk representation would reduce memory usage. But it's working. The memory usage is down from ~240M to ~160M. Nice punch!

Oct 29 '22 02:10 YuMingLiao

I couldn't find benchmark/OnDiskClient.hs for reproducing. I guess it's in your .gitignore or what.

Oct 29 '22 02:10 YuMingLiao

Oops, sorry. OnDiskClient is now pushed to the branch.

Before the patch, when we create tuple sets, each tuple references one shared Attributes blob. However, during deserialization, the attributes would be copies, one per tuple. After the patch, we deserialise the tupleset with one, shared Attributes value.

I'm doing some more profiling- there is likely more low-hanging fruit like this.

Oct 29 '22 02:10 agentm

I have further reduced the memory and CPU cost of Merkle hash validation. Could you take this branch for another spin?

Using OnDiskClient, with 100,000 tuples, the on disk storage is 3.8 MB and uses 30 MB of RAM. Is this acceptable for your use-case?

Nov 03 '22 15:11 agentm

At the first version in that branch, my server already started to be able to work! Thank you! I'll try the second one and report back.

Nov 04 '22 09:11 YuMingLiao

Hopefully, the improvements to the hashing should reduce the memory ceiling by another order-of-magnitude.

Nov 04 '22 13:11 agentm

nutrition-app

8 fields x 80000 tuples Now it's 13MB on disk. 90MB on mem.

It's pretty amazing. Thanks!

Nov 05 '22 09:11 YuMingLiao

Great! I'll merge this in. There is more low-hanging fruit here, but these ratios are pretty good for now. Expect further improvements to performance and memory usage once the streamly work is complete.

Nov 05 '22 14:11 agentm

project-m36 project-m36 copied to clipboard

Use more optimal structure for `Relation`

project-m36
project-m36 copied to clipboard