dictutil
dictutil copied to clipboard
Possible problem with marisa-build
@pgaskin No rush, but when you have some time please will you have a look at this.
Back in Feb 2020 you were kind enough to create some Windows 64bit .exe versions of various Marisa functions. I'm wondering whether my current copy of marisa-build.exe
is out-of-date for the purpose of updating my custom dictionaries to include the new prefix_exceptions
file?
Here's the details behind my question. It was all going so well for a while ...
It was straightforward to create a prex.txt
file (LF line-endings) containing variant_word tab headword_prefix
for the variant words which do not share the same prefix as their headword using:
marisa-build.exe -o prefix_exceptions prex.txt
All appeared to be OK and I rebuilt the custom dictionary with no problems. It even seemed to work after installing on the Kobo.
The problem arose when I decided to double-check everything by converting prefix_exceptions
back to TXT format using marisa-dump.exe:
marisa-dump.exe prefix_exceptions > prex_marisa_dump.txt
I already knew to expect that prex_marisa_dump.txt
would be created with CRLF line-endings and that it wouldn't be in the same sort sequence as the original prex.txt
. Having allowed for those differences the 2 files ought to match.
However, the 2 files don't match because every input entry for prefix '11' comes back out of marisa-dump
with the tab and the '11' missing. All other prefixes have out_entry matches in_entry, e.g.
AOK\t11\n
gets dumped as AOK\r\n
(original headword A-OK
)
AOK\tao\n
would get dumped correctly as AOK\tao\r\n
At first I thought marisa-dump.exe
might be the problem but now I don't think so because when I used it to dump the copy of prefix_exceptions
from the official new dicthtml.zip
all the prefix '11' entries showed correctly in the dumped TXT. So it must be:
- something to do with
marisa-build.exe
or at least my copy of it. - maybe prefix '11' entries need to be something other than
variant_word tab headword_prefix
when headword_prefix=11 I also double-checked that a word that fits this corner-case criteria (only 205 of them) cannot be looked up in the updated dictionary on the Kobo.
For completeness I did a similar marisa-build / marisa-dump 'round-trip' with the words
file. I had no problem matching the dumped TXT back to the original input index TXT for (headwords + variants).
If I need to provide any extra info just ask.
This won't be an issue with the Marisa binaries themselves, but it's possible the tools don't support building tries with tabs. In any case, my in-progress version of dictutil works fine with those files without any updates to Marisa. I'll send you a build of it once I finish with the new numbers in the words trie. Alternatively, since you're doing everything manually, if you're fine writing a small amount of Go code, you could use the github.com/pgaskin/dictutil/marisa
package directly to build the trie from a list of strings (either hard-coded or read with os.Open
and bufio.Scanner
), then use ioutil.WriteFile
to write it (I might have time to do that myself later today).
I'm afraid I'm not OK writing Go code but I've found a grubby workaround that seems to work - still testing.
If I stick an extra tab at the end of the line for var_words which redirect to prefix '11',
i.e. write the prefix exceptions to TXT as
AOK\t11\t\n
rather than AOK\t11\n
then I now get a correct lookup for word 'AOK'
I suppose it's Sod's Law that I didn't have this "bright idea" (possibly) before going to the trouble of writing it all down.
My version of marisa-build
definitely has a problem if the input file has 2 fields per line, separated by a tab, when the 2nd field is all digits. The problem does not occur if the input file has only one field and it's all digits.
So in Kobo terms there is no problem creating the words
file. The prefix_exceptions
file will only be problematic if the 2nd field (redirect to prefix) is all digits, i.e. '11'
I don't know how many existing MobileRead custom dictionaries use variants at all - probably very few.
Anyway, my workaround seems to work OK for me, so you can close this if you don't think marisa-build has an issue. I'm not sure I agree with you but maybe it's different on Linux.
I don't know how many existing MobileRead custom dictionaries use variants at all - probably very few.
None of the Penelope ones do, as it just discards variants of any kind. All of my personal dictionaries do, and so do about half the dictfiles I've recieved in support emails.
Anyway, my workaround seems to work OK for me, so you can close this if you don't think marisa-build has an issue. I'm not sure I agree with you but maybe it's different on Linux.
What I meant was marisa-build is a frontend to the actual Marisa library. There wouldn't be an issue with the Marisa library, so it's probably in the marisa-build. Thus, you can use my Go bindings to the Marisa library to create a custom frontend for it which doesn't have the parsing issues.