HOL
HOL copied to clipboard
Sharing & memory consumption after theory load
The SharingTables structure is used to compress the terms and types that appear in a theory's signature. The compression ensures that repeated terms and types are only stored once.
This compression is done within a theory's signature, but there's no attempt to handle cases where a term or type is repeated from an ancestor theory. I think that quite a bit of space and time could be saved here.
I came across this while investigating the high memory consumption of some CakeML tasks. I discovered that, after loading various theories, a PolyML.shareCommonData could compress PolyML's heap by 80%, sometimes close to 90%. The effect was substantial (> 70%) even if shareCommonData was limited to the conclusion terms of the theorems fetched by DB.listDB ().
This indicates that there's a fair amount of room to improve the sharing, at least for CakeML's use case. I suspect that other HOL theories would benefit from a more global sharing approach, and the .dat files might shrink quite a bit, although the export process would slow down as more of the world would have to be put in a canonisation table.
For discussion, I've thought through how to implement this, and I'd be happy to have a go.
The interface of SharingTables would probably have to change a bit. I would also add somewhere a global table of the identifier/type/term vectors that were created on loading each existing theory.
It would be good to get some feedback before I have a go at doing it.
I think it's a good idea. It may also be worth understanding where the blow-up occurs. For example, the pairing operator has a type 'a->'b->'a#'b that duplicates.
Konrad.
On Fri, Jan 11, 2019 at 9:57 AM Thomas Sewell [email protected] wrote:
For discussion, I've thought through how to implement this, and I'd be happy to have a go.
The interface of SharingTables would probably have to change a bit. I would also add somewhere a global table of the identifier/type/term vectors that were created on loading each existing theory.
It would be good to get some feedback before I have a go at doing it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HOL-Theorem-Prover/HOL/issues/640#issuecomment-453564013, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDgD_leIqoD71ab8JYgQOM2iOSEygTaks5vCLR6gaJpZM4Z7t1d .
I think there are two failings:
- the sharing is done for each theory independently. Which I guess is what you mean by there's no attempt to handle cases where a term or type is repeated from an ancestor theory. (Within any given theory, terms and types that are from ancestors are certainly shared, for which see an
IDS
section.) The advantage of this is that the theories’ sharing tables are transient. - Sharing only happens at the level of types and term-atoms (variables and constants). Repeated applications of the same term to the same argument (as happens in numerals for example) are not shared at all.
Yes, I meant to address the first failing, as @mn200 has described it, that the sharing is done per-theory. Yes, the advantage of per-theory sharing is the clean interface, whereas across-theory sharing requires keeping more state around. I think that the lookup table stays transient but the resulting type/term vector for each theory needs to be kept in memory.
I had not spotted the second issue. I had guessed from the interface (e.g. the existence of TMAp/TMAbs) that the system attempted to share compound terms. I'll investigate that also.
With regard to the second issue, it looks like SharingTables already knows how to share compound terms, but for some reason this isn't used, and .dat files only contain TMV and TMC.
Right - I did experiment with full term sharing but I got quite poor results. It's possible I was doing something wrong, but my memory is that the tables were super large (though of course, on a per-theory basis, they shouldn't be any larger than the sum of the sizes of the theory’s theorems), and it didn't seem to improve speed. (At the time the tables were in SML in the <foo>Theory.sml
files and so I was forcing the compiler (which may have been just Moscow ML at the time) to read/process and compile these tables as SML values, which is pretty unnecessary.)
OK, I've implemented a version of this ( see https://github.com/talsewell/HOL/tree/theory-share-terms ).
I've done some easy tests on effectiveness. Everything seems to be working. The *Theory.dat files inside HOL4 generally grow a bit, 20-30%. This is because we've moved more terms to a more verbose format. A couple of theories had some sharing and shrink though.
In the CakeML bootstrap translation, some *Theory.dat files shrink by over 20x, for instance explorerProgTheory.dat shrinks from 21M to 468K. There might be silly reasons for this, but it looks good.
I should probably run some more performance tests to look into the tradeoff. Is there a standard performance test anywhere?
Building the core theories is a somewhat reasonable performance test. Theory build times are logged to tools-poly/build-logs/
and you can compare times with the developers/comparelogs
tool.
You can also enable logging for other theories by passing a --logging
option to the Holmake invocation.
I might just blog where I'm up to here.
I compared core-theory build time, and I saw a bunch of variation, with some theories up or down by 10-20% time, and an overall time saving of about 5%. I think that some of the variation might come from parallel scheduling, and I'll devote some CPU time to running it again single threaded.
I also wrote some code to time the phase that I'm most worried about, the merging of the term tables from all the ancestor theories to set up the sharing table before exporting. Indeed this can take 10-20s for some otherwise innocent theories in CakeML. That's quite a bit, and maybe I should investigate possible optimisations before going further.
It occurs to me while typing the above that I should check for swapping. My test machine doesn't have a huge amount of RAM (8GB) and it would fit the evidence if parent vectors were sometimes being swapped out during the theory processing phase and swapped back in during the merge. Will investigate further.
Thanks for the update!
Is there something we could merge on this front?
I am only peripherally on this discussion. Think it is generally a good idea. Since a lot of the things to be shared are trees, I was intrigued by the general idea of using grammars for compression. Apparently this is a thing that can work. See various literature on compressing XML for example, or https://pdfs.semanticscholar.org/4b50/32229b20983a30394ea44a35e7808304d97d.pdf as an example with a NICTA connection.
Anyway, it could be a neat little final-year project to study trade-offs between the various compression algorithms.
(Obviously this is not a helpful comment in the short term, I just wanted to get the grammar-based compression stuff out of my brain and now I can erase it from my whiteboard.)
OK, back to my desk this week. I have a version that might be ready to merge, but the last time I tried it the result was a problem with PolyML. Different PolyML versions might fix that, but HOL4 users might not want (more) constraints on which PolyML version they use. I'll get around to contacting David Matthews and see if we can find a workaround at this end.
@konrad-slind: I think you're right, a lot can be done to compress structures better. I'm not sure how helpful that is here. The main goal is to reduce CPU time (rather than disk space or network bandwidth), so "clever" encoding schemes may cost too much in encode-time to be worth it. The other issue is that this is part of the essential core of the system, so having a complicated version might be an issue.
Any further word on this?
No, unfortunately. As @konrad-slind pointed out, it's a little tricky to write a version that's obviously an improvement, since it costs CPU time on each theory encoding, but saves disk/CPU/memory on startup of theories (like CakeML theories) that load a lot of things.
I had a version that was working OK but for some reason I started hitting PolyML issues that I didn't succeed in understanding or in minimising and running past the PolyML developers. Basically I gave up. I still have one of my versions on github but I haven't looked at it in a long time.