algorithm-archive
algorithm-archive copied to clipboard
Edge case fail in all Huffman Encoding implementations.
Whenever any (or most) of the Huffman Encoding implementations receive a string with only one character multiple times (think 'aaaaaaaaaaaaa', or 'bbbb'). The functions fail and give either errors, or weird empty output. Here are some examples of Python, Java and Lua respectively. I assume this has to be fixed because it nowhere states in the chapter that it shouldn't work on one-character-strings.
These are the ones so far that we know are broken (I or another maintainer have personally tested them)
- [x] C
- [ ] C++
- [ ] C#
- [ ] Go
- [ ] Java
- [ ] Javascript
- [x] Julia
- [ ] Lua
- [ ] Python
- [ ] Rust
and those that are untested
- [ ] x86-64 Assembly
- [ ] Clojure
- [ ] Haskell
- [ ] Scala
Here is a first stab at fixing the Python implementation:
diff --git a/contents/huffman_encoding/code/python/huffman.py b/contents/huffman_encoding/code/python/huffman.py
index 1e1cc54..75700cd 100644
--- a/contents/huffman_encoding/code/python/huffman.py
+++ b/contents/huffman_encoding/code/python/huffman.py
@@ -11,6 +11,9 @@ def build_huffman_tree(message):
frequencies = Counter(message)
trees = frequencies.most_common()
+ if len(trees) == 1:
+ return [trees[0][0]]
+
# while there is more than one tree
while len(trees) > 1:
@@ -37,6 +40,9 @@ def build_codebook(tree, code=''):
codebook = []
+ if len(tree) == 1:
+ return [(tree[0], '0')]
+
# split the tree
left_tree, right_tree = tree
I figure it's ok to not handle the case where the input is empty. Or should that be handled?
Right, this should catch it. I'm wondering it. I'll fix the ones I can soon!
https://stackoverflow.com/questions/22429854/huffman-code-for-a-single-character
I don't like the solution of using 1 bit to encode the solution. Mostly cause it makes the Kotlin implementation more complex than I feel it needs to be.
I think that, for the edge case, the empty string isn't suitable, since it doesn't give us the full description of the text. And, I also think that representing the only character with a 1 or a 0 is wasteful, since we are not using the other character at our disposal. So, I think the best way to go about with this is, if there is only one character in the text, the Huffman code becomes the binary representation of the number of occurrences of the single character in the given text (I think for simplicity, the representation should be in little endian order). Then, while decoding, if we see the code-book has only one entry (only one root node containing the character), we simply take the encoded text as the number of the occurrences of the character in the code-book. Although, this increases the complexity of the encoding & decoding functions, this method ensures,
- A complete description of the text
- The most efficient of the given text using only
1s and0s
The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.
The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.
That's actually not correct. Most implementations (eg DEFLATE) don't have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an Encoded object that stores the length and bits.
Yes, because those implementations operate down at the metal, where there needs to be some way to determine when to stop reading and interpreting bits, and as such either encode the length upfront or include an "end of string" token just like any other kind of string. For the version of the algorithm being discussed in this book, those extra details can and maybe should be mentioned in the chapter text, but I don't think this kind of handling should be done in the actual code just because it's often done elsewhere, since these implementations are meant to be illustrative of the main principle at work without any optimizations. Since the text does mention that the algorithm "guarantees the smallest possible bit length", that may be a good reason to implement this kind of optimization, though I am unconvinced, since any means to actually achieve that in our example implementations will involve either encoding the length or encoding the "end of string" token, both of which involve very low-level concepts that distract from the point of the chapter and may confuse a reader.
For the simplified version of the algorithm that is actually described in the chapter, the correct behavior would be:
- An empty string should be encoded with an empty codebook and an empty bytestring,
ashould be encoded with a codebook indicating that "a" is 0 (for example) and the bytestring should be0,- and
aaaaashould be encoded with the same codebook and the bytestring should be00000.
For the simplified version of the algorithm that is actually described in the chapter, the correct behavior would be:
- An empty string should be encoded with an empty codebook and an empty bytestring,
ashould be encoded with a codebook indicating that "a" is 0 (for example) and the bytestring should be0,- and
aaaaashould be encoded with the same codebook and the bytestring should be00000.
That's incorrect, the algorithm as described would encode "aaaaaa" as the empty string. Honestly, I would just explicitly disallow those strings in the implementation.
That's incorrect, the algorithm as described would encode "aaaaaa" as the empty string. Honestly, I would just explicitly disallow those strings in the implementation.
Then the algorithm as described has a bug in it, and we should figure out whether to just disallow all strings that would cause that bug or to fix it and if so how.
In any case, I sent Leios the original paper the algorithm originated in to read, so we'll see if there are any answers to this there.
On the one hand, fixing for those kind of strings in Python/Coconut is pretty easy. You can see how I did it in the Coconut Huffman encoding PR (trying not to brag, but sorry if I seem like it), where I used the solution that @thecnoNSMB provides. The paper does not really provide a feel for the edge cases, though. Realistically, you already have some kind of Huffman tree or code book that is already lying around, or you don't really need to build them from single-character (repeated or not) inputs or empty string inputs... If the only kind of messages you can send are "no message" and "here are x characters", do you really need to go through the trouble of building a tree or a code book? Just use (possibly repeated) single digits!
So my point is that the fix is done on my part, but that it's not really realistic to have to do it.
This issue is also there in my Racket implementation (#823). The input "bbbbb" returns an empty string as the encoded message. I could fix this to return 00000.
The way I am doing it, the Huffman tree for "bbbbb" consists of just one leaf node corresponding to the symbol b. I see two ways of fix it:
- Make it so that the Huffman tree for such an input consists of a
branchand the left child of that branch is theleaf. - Add a simple
ifcondition to deal with such a tree.
I took a quick look and this issue is also there in the Haskell implementation. The encoded procedure returns an empty string and the decoded procedure returns an infinite list of b's.
My guess is that this is also due to the same reason as mine: leaves and non-leaf nodes are treated differently. So it might be possible to fix in one of the two ways I suggested above.
There is also a small typo in the Haskell implementation. encoding is spelled as endoding.
If you want to update the haskell implementation, I think that would be very much appreciated!
Thanks for bumping this thread, btw! I apparently didn't have the code fixed in Julia either, so I created a new PR (#828) to do this. I used the if statement to deal with a tree composed of only 1 leaf, but maybe the branch approach would have been better? I am happy with either.
I was eavesdropping this conversation (I made the Haskell implementation). Thank you for spotting the bug, here is a possible fix:
decode :: (Ord a) => Tree a -> [Bit] -> [a]
decode (Leaf i a) _ = replicate i a
decode t msg = path t msg
where ...
In this version encoding "aaa" gets you (Leaf 3 'a', []) basically en empty encoded message, since in this case the whole message is encoded in the tree.
Should I make a PR? Or do you want to include it in yours @lazyprop?
@jiegillet go ahead! I don't really know Haskell anyway xD. Though I'm still concerned whether it should be an empty string or "0000". The discussion that happened on this thread last year didn't end in an agreement. What do you think @leios?
The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.
That's actually not correct. Most implementations (eg DEFLATE) don't have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an
Encodedobject that stores the length and bits.
I mean, Kroppeb is right here. In actual implementations, it makes sense that we keep track of the length of the data... This should be mentioned in the chapter (and we should also explicitly link to deflate), but I feel that it is better to just return the 0's for now because it's easier to reason about for now (it's also what I was taught in data compression, so I think it is the more common approach for math folks).
I have been wanting to talk about the Vitter algorithm for a while as an extension to this one and think that is a good place to introduce the more technical aspects.
I think returning an empty sequence makes more sense, because the algorithm is all about compression. and (Leaf 'a' 100, []) is a heck of a lot more compressed than (Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")
I think returning an empty sequence makes more sense, because the algorithm is all about compression. and
(Leaf 'a' 100, [])is a heck of a lot more compressed than(Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")
The problem is that following the Huffman tree building algorithm naively, I think you get
(Leaf 'a', '0') with which the 100 'a's are transformed into 100 '0's.
The other problem is that you still need to (consistently) signal the end of your string.
So either you say that if you get a 1 element codebook, you get the binary representation of the number of elements in the encoded string, or you need to return the whole string of 0s no matter what.
Now that I think about it, the weights in the codebook building are arbitrary and technically decorrelated from the string to encode, except if you build the codebook with that string (which, although it is the case here, normally never happens). In all common cases, if in the codebook there was say the entry ['a', '10'], the string of 100 'a's would be "10"*100, which is actually longer in terms of characters than the input sequence. However the thing to remember is that while the encoded sequence of characters is just as long or longer, it should be thought of as a sequence of bits. And as a sequence of bits, it's a 7-fold or 8-fold compression (assuming ASCII or appropriate character encoding for most western languages).
I don't think sending an empty sequence makes sense, as it should probably be reserved for the encoding of the empty string.
Yeah, there are actually a lot of weird details to the code right now that should be fixed for efficiency. In many languages, we are not even using bits, but instead chars to act as bits. I think at the time, julia's bit types had just changed and I couldn't find documentation on how to use them properly anymore, so I just used strings. We really should be using actual bits. I'll see about fixing that in the Julia PR as well.
To make sure I am following, we are thinking about returning either:
- "000...000"
- ("c", n) where c is a character and n is the number of characters in the string
It seems to me that 2 is only more efficient in the case where we have a small number (less than 8, 16, 32, 64... whatever the precision of the Int is) of the same character repeated, which means we are paying attention to an edge-case that would be very difficult to actually see in practice, right?
To make sure I am following, we are thinking about returning either:
"000...000" ("c", n) where c is a character and n is the number of characters in the string
That's not exactly it. I'm thinking or returning either
- "000...000" + the codebook
Leaf 'a' 100 - "" + the codebook
Leaf 'a' 100
The reasoning behind 2 is that if the codebook is always generated with the message to encode, Leaf 'a' 100 is enough to reconstruct the message since we know there is only one character and we know the frequency. However @Amaras pointed out that in a situation where the codebook is not sent with the message (maybe there is a predefined codebook for the English language or something?) or maybe the frequencies are renormalized, then you need to send the 0's. To be fair, that is also a bit of a stretch because there is no way that people would use a predefined codebook of one character :)
Still, @Amaras convinced me and I now think solution 1 is the best. But this is a pathological case, any decision is fine so let's pick one and run with it. @leios, I think if you make the call, people will follow.
I really don't think we need to worry over bits vs characters for the bitstring, the algorithm is the important part, we are not building libraries here.
And I forgot to remove the link again, sorry
I need to learn to stop saying "This fixes #..." when it only fixes 1 part of the full issue...