lttoolbox icon indicating copy to clipboard operation
lttoolbox copied to clipboard

New section type that doesn't minimise

Open ftyers opened this issue 4 years ago • 9 comments

At the moment we add regexes in sections. Minimising regexes takes a long time. So perhaps we could have a special type="regex" section that does not minimise, it would speed up compilation of regex-heavy dictionaries.

This will likely break binary compatibility.

ftyers avatar Jun 26 '20 09:06 ftyers

or it could union with some other section after that section has been minimized to avoid having to create a new section in the binary.

mr-martian avatar Jun 26 '20 14:06 mr-martian

@mr-martian that sounds a bit more complicated. Also, it would be cool to be able to give weights to sections, but I'll open another issue for that.

ftyers avatar Jun 26 '20 14:06 ftyers

Upon poking around a bit, I've determined that this would not break the binary format, since section types are just encoded as strings and lt-proc already handles multiple sections of the same type. Have lt-comp relabel type="regex" to type="standard" would result in complete backwards compatibility, or lt-proc can just recognize section names ending in @regex and treat them like @standard.

Either way, this should probably be accompanied by a way to mark <pardef>s as non-minimizing for the same reason. regex="yes", perhaps.

mr-martian avatar Mar 20 '21 17:03 mr-martian

This should be optional. For development it should be fast to compile and test, but for distribution it should heavily optimize to the smallest/fastest output binary.

TinoDidriksen avatar Mar 21 '21 10:03 TinoDidriksen

Also, it occurs to me that this is tricky because lt-comp minimizes each pardef separately in addition to each section.

mr-martian avatar May 27 '21 23:05 mr-martian

But this is about speed – is minimising each pardef on its own slow? (Last time I checked, the section minimisation at the end was the slow step.)

unhammer avatar May 28 '21 07:05 unhammer

Another alternative is that 0493630 added the ability to compile dictionaries in several pieces, which should alleviate the burden of frequently recompiling the regex sections.

mr-martian avatar Mar 26 '22 15:03 mr-martian

In fact, we could have globally shared regex sections, as proposed in https://github.com/apertium/apertium/pull/161

mr-martian avatar Mar 26 '22 15:03 mr-martian

minimisation has gotten quite a bit faster lately. but there's a related pr at https://github.com/apertium/lttoolbox/pull/165

unhammer avatar Sep 29 '22 17:09 unhammer