shoco
shoco copied to clipboard
fix generate_compressor_model.py for finding best encodings
Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?
Unrelated: I opened a PR to fix some type errors I had when running the script (#47). Did you have any of these type errors?
I tried running make
on your branch and I'm getting the following error:
python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
main()
File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
chunks = list(chunkinator(args.file, args.split, args.strip))
File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
for chunk in chunks:
File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1
And also:
Traceback (most recent call last):
File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
main()
File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found
I tried running
make
on your branch and I'm getting the following error:python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h finding bigrams ... Traceback (most recent call last): File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module> main() File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main chunks = list(chunkinator(args.file, args.split, args.strip)) File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator for chunk in chunks: File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr> chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in) TypeError: can't concat str to bytes make: *** [Makefile:33: models/words_en.h] Error 1
And also:
Traceback (most recent call last): File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module> main() File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main max_chr = ord(max(successors.keys())) + 1 TypeError: ord() expected string of length 1, but int found
@perronet I have fix the previous errors for python3, the testing command is
python3 ./generate_compressor_model.py --optimize-encoding --split whitespace -o >(cat) generate_compressor_model.py
Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?
fix some logic issue when searching best encodings, such as line 159 last_char = part[0]
does not update last_char
. And also improve the performance of searching, move the encoding loop to outer, which can filter encodings once.
https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119
Please change this line with
self.packed = sum(bitlist) // 8
You can see here that the generated bytes_packed
is accidentally a float
https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119
Please change this line with
self.packed = sum(bitlist) // 8
You can see here that the generated
bytes_packed
is accidentally a float
this is already done in pr #18 , can we merge this first and this pr will rebase to the latest
Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.
In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck
Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.
In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck
the float packed
issue is fixed.
Nice! I'm just using your fork right now since it's unclear if this PR will ever get merged