img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>

Open geroldmeisinger opened this issue 1 year ago • 2 comments

using laion2b-en-aesthetics65.parquet entry #3

"San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it"

Error:

Traceback (most recent call last):
  File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in download_shard
    sample_writer.write(
  File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\writer.py", line 280, in write
    f.write(str(caption))
  File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>
Sample 3 failed to download: 'charmap' codec can't encode characters in position 74-78: character maps to <undefined>

Result: jpg was downloaded empty: 000000003.txt missing: 000000003.json

Using Windows 10 with Miniconda

similar issue: https://github.com/rom1504/img2dataset/issues/219

geroldmeisinger avatar Sep 05 '23 08:09 geroldmeisinger

Yeah I'm afraid windows still doesn't work very well. Use Linux instead? (There is a feature to do that under windows now)

On Tue, Sep 5, 2023, 10:33 Gerold Meisinger @.***> wrote:

using laion2b-en-aesthetics65.parquet entry #3 https://github.com/rom1504/img2dataset/issues/3

"San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it"

Error:

Traceback (most recent call last): File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in download_shard sample_writer.write( File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\writer.py", line 280, in write f.write(str(caption)) File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode characters in position 74-78: character maps to Sample 3 failed to download: 'charmap' codec can't encode characters in position 74-78: character maps to

Result: jpg was downloaded empty: 000000003.txt missing: 000000003.json

Using Windows 10 with Miniconda

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WLCVG245F6CDLTROTXY3PWXANCNFSM6AAAAAA4LM4KNM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rom1504 avatar Sep 05 '23 08:09 rom1504

In my case it worked after specifying the encoding in writer.py:

def write(self, img_str, key, caption, meta):
  ...
  with self.fs.open(caption_filename, "w", encoding="utf-8") as f:
  ...
  with self.fs.open(meta_filename, "w", encoding="utf-8") as f:
  ...

sraimund avatar May 16 '24 09:05 sraimund