img2dataset
img2dataset copied to clipboard
UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>
using laion2b-en-aesthetics65.parquet entry #3
"San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it"
Error:
Traceback (most recent call last):
File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in download_shard
sample_writer.write(
File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\writer.py", line 280, in write
f.write(str(caption))
File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>
Sample 3 failed to download: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>
Result: jpg was downloaded empty: 000000003.txt missing: 000000003.json
Using Windows 10 with Miniconda
similar issue: https://github.com/rom1504/img2dataset/issues/219
Yeah I'm afraid windows still doesn't work very well. Use Linux instead? (There is a feature to do that under windows now)
On Tue, Sep 5, 2023, 10:33 Gerold Meisinger @.***> wrote:
using laion2b-en-aesthetics65.parquet entry #3 https://github.com/rom1504/img2dataset/issues/3
"San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it"
Error:
Traceback (most recent call last): File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in download_shard sample_writer.write( File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\writer.py", line 280, in write f.write(str(caption)) File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to
Sample 3 failed to download: 'charmap' codec can't encode characters in position 74-78: character maps to Result: jpg was downloaded empty: 000000003.txt missing: 000000003.json
Using Windows 10 with Miniconda
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WLCVG245F6CDLTROTXY3PWXANCNFSM6AAAAAA4LM4KNM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
In my case it worked after specifying the encoding in writer.py:
def write(self, img_str, key, caption, meta):
...
with self.fs.open(caption_filename, "w", encoding="utf-8") as f:
...
with self.fs.open(meta_filename, "w", encoding="utf-8") as f:
...