shoco fix generate_compressor_model.py for finding best encodings

Aug 21 '23 09:08 gzm55

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

Unrelated: I opened a PR to fix some type errors I had when running the script (#47). Did you have any of these type errors?

Sep 08 '23 15:09 perronet

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

Sep 08 '23 17:09 perronet

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

@perronet I have fix the previous errors for python3, the testing command is

python3 ./generate_compressor_model.py --optimize-encoding --split whitespace -o >(cat) generate_compressor_model.py

Sep 09 '23 02:09 gzm55

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

fix some logic issue when searching best encodings, such as line 159 last_char = part[0] does not update last_char. And also improve the performance of searching, move the encoding loop to outer, which can filter encodings once.

Sep 09 '23 02:09 gzm55

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with

self.packed = sum(bitlist) // 8

You can see here that the generated bytes_packed is accidentally a float

Sep 09 '23 10:09 perronet

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with
self.packed = sum(bitlist) // 8
You can see here that the generated bytes_packed is accidentally a float

this is already done in pr #18 , can we merge this first and this pr will rebase to the latest

Sep 09 '23 10:09 gzm55

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

Sep 09 '23 11:09 perronet

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

the float packed issue is fixed.

Sep 09 '23 15:09 gzm55

Nice! I'm just using your fork right now since it's unclear if this PR will ever get merged

Sep 21 '23 18:09 perronet

shoco shoco copied to clipboard

fix generate_compressor_model.py for finding best encodings

shoco
shoco copied to clipboard