link-grammar
link-grammar copied to clipboard
Reducing linkage memory working set: typedef s64 Count_bin
Is there a reason the linkage count is 64 bits, considering it gets truncated to INT_MAX in count.c, and values greater than 24 bits are considered as PARSE_NUM_OVERFLOW?
I ask this because it makes Table_connector_s to be more than 64 bytes, when using int instead will make it 32 bit.
Currently, for long sentences most of the program time consists of waiting for the access to Table_connector_s, whose size causes a CPU cache trash. Also, memory manipulation of 64 ints is said to be more costly than 32 bits. (I just found another mean to ease the Table_connector cache trash, and intend to send a PR soon for it and for other significant do_count() speed-ups.)
So unless I missed something, I propose to make the count 32 bits (even though I don't have an empirical proof that this can make the linkage faster).
BTW, I already have run-time results from reducing the Connector struct from 64 to 32 bit, proving that reducing the memory working set needed in the linkage may make the linkage faster.
On the test sentence "The problem is, ..." (52 words w/o walls - counting punctuation too, tested under the connector-enumeration version) just reducing the connector size to 32 bits makes the linkage faster by >~12% (when the connector enumeration itself makes it faster by >~14% - a total of ~26%). But this is only when the connectors are packed into an array just before the linkage. But packing 64-bit connectors increases the run-time (~3%), maybe because it is just a CPU waste without any caching impact. Packing the connectors into an array may also allow for 8 and even 4 byte connectors (e.g no need for the next field), a thing that I would like to test soon.
To facilitate such checks I will first need to implement Connector access methods for all of its attributes, including "next", i.e. to use in do_count.c instead of le->next: connector_next(le). It will make the source-code more verbose, and I hope you don't mind that... (For my current connector-enumeration version, I already implemented Connector accessors like connector_uc_hash(),
connector_desc(), connector_string(), and more.)
Somewhere in the counting code, there is a "total_count = left_count * right_count" and this multiplication occurs in a recursive call. The 32-bit version quickly overflows even for "small" values of left, right count -- e.g. 2^16
I changed this to 64-bit a very long time ago, and that eliminated "most" overflows. However, it still overflowed, and so it needed a range-check. But I left it as 64-bit, because that was easier.
Still, 2^32 and even 2^16 are crazy-large numbers for parsing language. So changing count to 32-bit is OK., I guess.
Please keep in mind: in standard C/C++, it is legal for a compiler to let "int" be 64-bit, so if you want 32, you must say int32.
Also: there is another solution -- change the counting algorithm to be aware of and track the factor tree. Thus, for example:
"This sentence (ambiguous part A) is a long (ambiguous part B)"
If there are 7 different ways of parsing (ambiguous part A) and 13 ways of parsing (ambiguous part B) and if there are no long links joining these two (i.e. only "one way" of parsing the rest) then the count is 7*13=91. it would be nice to be able to tell the user that the sentence contains two ambiguous parts, one is part A, another is part B. That way, the user can look at 7+13=20 possibilities, instead of the 91 possibilities that the current API forces the user to examine.
Detecting this factorization during counting is not hard: if "total_count = left_count * right_count" then it factors, and if "total_count = total_count + more" then it does not factor. Inventing a new user API to tell the user about factored linkages .. this is harder. I never tried to think hard about that.
My gut intuition is that, in the long term, factorization is very important -- especially if "he/she" in one sentence refers to "adam/eve" in another sentence -- the combinatoric explosion from a lack of factorization is .. explosive.
A factorization API would need to be something like this: get_sentence_factors() returns a list of triples: (start_word, end_word, number_of_alternatives) that are non-overlapping, i.e. previous_end_word < current_start_word. If there are k regions, then the list is exactly k items long.
Next, there would have to be some `get_linkage_factor_alternative() that returns a k-tuple, if there are k factors, and the tuple tells you which alternative is being explored in that linkage.
There would also need to be some sort of get_linkage_alternative() where you give it a k-tuple, and it returns just that linkage.
The above API could be made to be of the same style as the current API, or it could break with tradition...
Somewhere in the counting code, there is a "total_count = left_count * right_count" and this multiplication occurs in a recursive call. The 32-bit version quickly overflows even for "small" values of left, right count -- e.g. 2^16
I changed this to 64-bit a very long time ago, and that eliminated "most" overflows. However, it still overflowed, and so it needed a range-check. But I left it as 64-bit, because that was easier.
Still, 2^32 and even 2^16 are crazy-large numbers for parsing language. So changing count to 32-bit is OK., I guess.
Cannot it just be solved, in its current form, but a temporary (on stack) 65 bit variable? This is because before do_count() exists, in any case it is truncated to 32 bits.
If I don't miss something, I will send a PR (to be a basis for further changes in my current WIP branches).
Detecting this factorization during counting is not hard: if "total_count = left_count * right_count" then it factors, and if "total_count = total_count + more" then it does not factor. Inventing a new user API to tell the user about factored linkages .. this is harder. I never tried to think hard about that.
We will need to find out which data structure is needed to hold the factorization info - or can it be extracted from the current Parse_Set struct?
The above API could be made to be of the same style as the current API, or it could break with tradition...
Maybe we can return data in JSON format, like what I suggested in issue #577?
a temporary 65 bit variable?
Yes.
I will send a PR
Yes, OK.
data structure is needed to hold the factorization info - or can it be extracted from the current Parse_Set struct?
Don't know. I suspect some but not all of the info is there.
JSON
Yes! Its probably a good idea to present json variants of many of the current API's! Yes, I like that!
Related work status:
- Profiling of (classic) parsing of long sentences indicates that the main overhead is waiting for memory read, especially in the count_context hash table (very low processor cache hit, CPU mostly in memory wait state). There is 5 fold potential speed increase opportunity here if most of this specific wait state can be eliminated.
- The storage used per connector pair is bigger than needed. Reducing it will reduce the needed working set.
One example has been discussed above - the storage for the linkage count can be easily reduced to 32 bit. I did this modification, but it saves only about 1% CPU time on very long sentences.
Other WIP indicate that the storage per connector pair memoizing can be reduced much further. For example, no actually need to storelwandrwif we store insteadle+lwandre+rw(about 4% speedup).
However, data-depended memoization can save much more space. The idea is to use more CPU cycles for packing/unpacking in order to save memory wait cycles. -- Normally no need to memoize the full linkage count - memoizing 6 bits of it is enough for absolutely most of the connector pairs (0 and 1 are the most common values). Memoizing the full count is needed only in the relatively rare cases of bigger counts. -- No need to memoizeleandreas 64 bit pointers - using an array for disjuncts and connectors allows to memoize less than 32 bits per connector (24 bits will allow for about 16M connectors (or 32M - depending whether connectors of each direction are enumerated separately) while the biggest examples inen/corpus-failures.batchhave < 1M connectors). -- In the common cases ofle==NULLandre==NULL, much less memosing space is needed. - Just now, the count_context table is a big monolithic one, when the connector pairs are hashed randomly in it. This severely trashes the processor cache. A speedup is possible if memory references when parsing a certain sentence word range will be local as possible. Possible solutions: -- Allocate Table_connector elements from a pool (done). This is a very small change, with a speedup of about 8% for long sentences (it speeds up even short sentences). -- Implement connector pair table per sentence word (WIP). This not only localize memory access during parsing, ~~but eliminates the need to memoize one of the connector pointers~~ (EDIT: In the version of pair table per connector, not per word). To that end, extendable hash tables have to be used. (An extendable main hash table, that replaced the current table, has already been tried). This will overcome a current problem of the count_context hash table, of often being much too small or too big for the parsed sentence. -- Store the memoizing values inside the table (instead of hash chains), in a hope for more localized memory access (needs a check).
To sum up: I will not yet send PRs for the finished works (32 bit linkage counter, table_connector allocation from pools, eliminating memoizing of lw and rw) even so the total speedup is 10+%, because it seems there are ways for a greater speedup that need to be tested.
Seems reasonable ...
Update: WIP on (2) above (for possible comments). I wrote:
-- No need to memoize le and re as 64 bit pointers - using an array for disjuncts and connectors allows to memoize less than 32 bits per connector (24 bits will allow for about 16M connectors (or 32M - depending whether connectors of each direction are enumerated separately) while the biggest examples in en/corpus-failures.batch have < 1M connectors).
Actually 16 bits per connector is enough, since the ordinal number of the connector on the word can be used (this allows 64K(-)+64K(+) connectors per word).
-- In the common cases of le==NULL and re==NULL, much less memosing space is needed. In that case a different special hash table can be used.
For example, no actually need to store lw and rw If the hash table is implemented per word pair, naturally there is no need to store lw and rw.
-- Normally no need to memoize the full linkage count - memoizing 6 bits of it is enough I am still not sure, but maybe storing 24 bits of the linkage count is enough even now since values greater than it are considered as an overflow when recalculating when making the parse set (this needs more checks).
The summary of all of the above is that 8 bytes per memoized value (instead of the current 44) may be enough:
struct Table_connector_s
{
int le:16, re:16;
int count:24;
int null_count:8;
};
(Table_connector *next is not used in my new hashing scheme since it leads to memory thrashing.)
I implemented a 32bit Count_bin and there was some CPU saving but it was negligible (less than 1%).
I also tried something much more radical: I implemented a struct format in which the average memory per memoized element is only 4 bytes, and also a variable size hash table.
Even though it had tremendous measured reduction in the total runtime memory for long sentences, the run times were much longer... I think this is because it needs to touch more memory locations in order to fetch each hashed entry, a thing that causes more CPU cache misses.