bpemb icon indicating copy to clipboard operation
bpemb copied to clipboard

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Open srolskyi opened this issue 11 months ago • 7 comments

Fresh installation, setup new environment (python 3.9.18 or 3.12):

serg: ~ : python3 -m venv new_env serg: ~ : source new_env/bin/activate (new_env) serg: ~ : pip install bpemb gensim Collecting bpemb Downloading bpemb-0.3.4-py3-none-any.whl.metadata (19 kB) Collecting gensim Using cached gensim-4.3.2-cp312-cp312-macosx_10_9_universal2.whl Collecting numpy (from bpemb) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 949.1 kB/s eta 0:00:00 Collecting requests (from bpemb) Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB) Collecting sentencepiece (from bpemb) Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB) Collecting tqdm (from bpemb) Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.6 MB/s eta 0:00:00 Collecting scipy>=1.7.0 (from gensim) Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.9/217.9 kB 3.3 MB/s eta 0:00:00 Collecting smart-open>=1.8.1 (from gensim) Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB) Collecting wrapt (from smart-open>=1.8.1->gensim) Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB) Collecting charset-normalizer<4,>=2 (from requests->bpemb) Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB) Collecting idna<4,>=2.5 (from requests->bpemb) Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB) Collecting urllib3<3,>=1.21.1 (from requests->bpemb) Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB) Collecting certifi>=2017.4.17 (from requests->bpemb) Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB) Downloading bpemb-0.3.4-py3-none-any.whl (19 kB) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 67.8 MB/s eta 0:00:00 Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl (31.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.4/31.4 MB 59.3 MB/s eta 0:00:00 Downloading smart_open-7.0.1-py3-none-any.whl (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.6 MB/s eta 0:00:00 Downloading requests-2.31.0-py3-none-any.whl (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 4.4 MB/s eta 0:00:00 Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.6 MB/s eta 0:00:00 Downloading tqdm-4.66.2-py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 7.3 MB/s eta 0:00:00 Downloading certifi-2024.2.2-py3-none-any.whl (163 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 12.8 MB/s eta 0:00:00 Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 10.6 MB/s eta 0:00:00 Downloading idna-3.6-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.9 MB/s eta 0:00:00 Downloading urllib3-2.2.1-py3-none-any.whl (121 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 10.1 MB/s eta 0:00:00 Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl (38 kB) Installing collected packages: sentencepiece, wrapt, urllib3, tqdm, numpy, idna, charset-normalizer, certifi, smart-open, scipy, requests, gensim, bpemb Successfully installed bpemb-0.3.4 certifi-2024.2.2 charset-normalizer-3.3.2 gensim-4.3.2 idna-3.6 numpy-1.26.4 requests-2.31.0 scipy-1.12.0 sentencepiece-0.2.0 smart-open-7.0.1 tqdm-4.66.2 urllib3-2.2.1 wrapt-1.16.0

(new_env) serg: ~ : python3 --version
Python 3.12.2

then run python3 -c "from bpemb import BPEmb; bpemb_en = BPEmb(lang='en', dim=100)"

and got error:

_Traceback (most recent call last): File "", line 1, in File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/bpemb.py", line 191, in init self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/util.py", line 78, in load_word2vec_file vecs = KeyedVectors.load_word2vec_format(word2vec_file, binary=binary) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 1719, in load_word2vec_format return _load_word2vec_format( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 2058, in load_word2vec_format header = utils.to_unicode(fin.readline(), encoding=encoding) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/utils.py", line 365, in any2unicode return str(text, encoding, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any ideas where am I make a mistake?

srolskyi avatar Mar 15 '24 14:03 srolskyi