pybitcointools icon indicating copy to clipboard operation
pybitcointools copied to clipboard

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists

Open wizardofozzie opened this issue 10 years ago • 12 comments

Bisect only works for sorted wordlists; not an issue if only english.txt is used, however, forks using standard bip39 wordlists will fail most ungracefully.

eg (Japanese wordlist, forum post)

wizardofozzie avatar Nov 25 '15 22:11 wizardofozzie

Yes, this line is wrong

wordlist_english=list(open(os.path.join(os.path.dirname(os.path.realpath(file)),'english.txt'),'r'))

cryptid11 avatar Dec 07 '15 00:12 cryptid11

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and ALL the wordlists being lexographically sorted is a part of the bip39 spec

  • the wordlist is sorted which allows for more efficient lookup of the code words (i.e. implementations can use binary search instead of linear search)
    • this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

Steve132 avatar Jan 20 '16 21:01 Steve132

Perhaps it might be best for convenience reasons to include an assert loop in the code that makes sure a list is sorted before using it? This seems to have come up many times now.

On Wed, Jan 20, 2016 at 4:58 PM, Steven Braeger [email protected] wrote:

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and the wordlists being lexographically sorted are a part of the bip39 spec https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#Wordlist

  • the wordlist is sorted which allows for more efficient lookup of the code words (i.e. implementations can use binary search instead of linear search)
    • this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173374575 .

vbuterin avatar Jan 21 '16 00:01 vbuterin

@vbuterin In my fork there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR)

wizardofozzie avatar Jan 21 '16 11:01 wizardofozzie

Actually, I've been doing some research with this and what should really happen is that the search needs to degrade to a linear search in the event that sorted(wordlist) != wordlist...OR a hashtable should be used to compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index needs to match the index of the word AS SEEN IN THE FILE...whereas implementing binary search or resorting the input list will use the CURRENT LOCALE for the comparisons, thus possibly permuting the file order and thus making the detected indices incorrect.

The correct behavior will be one of the following options:

  1. Always implement linear search (this is what my current version does in my branch) Upside: this is simple and always works. Downside, it's O(n) for every lookup
  2. Implement a hash table that maps words->indices. Upside: This is simple-ish. Downside: it's O(n) for every lookup unless we implement it as a pre-processing stage (like have an internal wordlist type that indicates a correct hash table, and if the wordlist argument is a list convert it and cache the result or something)
  3. Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected] wrote:

@vbuterin https://github.com/vbuterin In my fork https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101 there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR) NB.

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173546537 .

Steve132 avatar Jan 21 '16 18:01 Steve132

  1. Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

If you're computing the sorted list anyway, would it not be simpler to just cache the sorted list for each list?

On Thu, Jan 21, 2016 at 1:47 PM, Steven Braeger [email protected] wrote:

Actually, I've been doing some research with this and what should really happen is that the search needs to degrade to a linear search in the event that sorted(wordlist) != wordlist...OR a hashtable should be used to compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index needs to match the index of the word AS SEEN IN THE FILE...whereas implementing binary search or resorting the input list will use the CURRENT LOCALE for the comparisons, thus possibly permuting the file order and thus making the detected indices incorrect.

The correct behavior will be one of the following options:

  1. Always implement linear search (this is what my current version does in my branch) Upside: this is simple and always works. Downside, it's O(n) for every lookup
  2. Implement a hash table that maps words->indices. Upside: This is simple-ish. Downside: it's O(n) for every lookup unless we implement it as a pre-processing stage (like have an internal wordlist type that indicates a correct hash table, and if the wordlist argument is a list convert it and cache the result or something)
  3. Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie [email protected] wrote:

@vbuterin https://github.com/vbuterin In my fork < https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101

there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR) NB.

— Reply to this email directly or view it on GitHub < https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173546537

.

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173671152 .

vbuterin avatar Jan 21 '16 23:01 vbuterin

No, because then the result would be incorrect. The whole point is that the spec says that the 11 bits correspond to the index of the word in the given wordlist file, in the order given in the file. This is important, because the spec doesn't specify how the list is supposed to be searched, only that the result must be consistent with the order given in the standard file for compatibility.

They are supposed to be sorted in the file, but the bug appears because the sort order for comparing two strings is locale-dependant...that is, if we compute cached_lists[id(wordlist)]=sorted(wordlist), and the code runs in a locale different than the one that generated the original standard file, the sort order will be WRONG and produce the incorrect 11 bits on a lookup, because the words will appear at a different index than the one in the file...which is unacceptable for obvious reasons

You could compute cached_hashtables[id(wordlist)]=dict([(value,index) for index,value enumerate(wordlist)]) and that would be correct AND be O(1), but that's some significant complexity when linear search through 2048 strings is probably not THAT expensive relative to how often this is likely to be needed, ESPECIALLY in python where it's unlikely to be a high-performance embedded application anyway.

Steve132 avatar Jan 22 '16 00:01 Steve132

I fixed this with a pull request https://github.com/vbuterin/pybitcointools/pull/141

It uses a hash table for the index lookup and word check operations which is O(1) and doesn't require locale-dependant string comparisons to do either operation. All functions which expect a wordlist can use this type as the wordlist for improved performance, but they all degrade gracefully to O(n) cases if the user provides a list as the wordlist instead of an instance of the type.

It also adds support for ALL the languages in bip39, which are now selectable. English remains the default, but you can use any of the others without needing to load them yourself (e.g. bitcoin.words_verify(spanish_phrase,wordlist=bitcoin.wordlists['spanish'])

Steve132 avatar Jan 24 '16 02:01 Steve132

@Steve132 I've just taken a cursory glance, but it looks fantastic so far, great work! Take a look at this discussion. Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

wizardofozzie avatar Jan 25 '16 14:01 wizardofozzie

Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

This patch fixed the locale-based search issues because it uses only the original index of the standard wordlist, NOT the index calculated from a sorted wordlist or a binary search with string comparisons. It remains fast by using a hash table.

I think someone should patch the BIP39 specification where it says the lists must be sorted so as to enable binary search, as we've seen binary-search is not possible to do in a locale-independant way. Instead perhaps the specification should have a recommendation that a hash function be used, perhaps even including C code and look up tables for a minimum-perfect-hash implementation for all languages.

Steve132 avatar Jan 25 '16 18:01 Steve132

@wizardofozzie hi! i'm trying to use your fork, but i found the tests are failing. can you add issues to your proyect, so users can report this kind of errors? thanks!

reiven avatar Jan 29 '16 18:01 reiven

@reiven Absolutely, mate. I've been dabbling in iOS Pythonista 2.0, Pythonista 1.6 and another branch, so I'll be consolidating and testing the code ASAP (~1-2 days maximum)

wizardofozzie avatar Jan 31 '16 06:01 wizardofozzie