mmocr icon indicating copy to clipboard operation
mmocr copied to clipboard

UnicodeDecodeError: 'charmap' codec can't decode byte - while running prepare_dataset.py on funsd dataset

Open hchintada opened this issue 2 years ago • 1 comments

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

sys.platform: win32
Python: 3.8.16 (default, Jan 17 2023, 22:25:28) [MSC v.1916 64 bit (AMD64)] CUDA available: False
numpy_random_seed: 2147483648
MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.34.31937 for x64
GCC: n/a
PyTorch: 1.13.1
PyTorch compiling details: PyTorch built with:

  • C++ Version: 199711
  • MSVC 192829337
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 2019
  • LAPACK is enabled (usually provided by MKL)
  • CPU capability usage: AVX2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTH READPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DED GE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, US E_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.14.1 OpenCV: 4.7.0 MMEngine: 0.1.0 MMOCR: 1.0.0rc5+27b6a68

Reproduces the problem - code sample

python mmocr/utils/collect_env.py

Reproduces the problem - command or script

python mmocr/utils/collect_env.py

Reproduces the problem - error message

Dataset Name: FUNSD
License Type: FUNSD License
License Link: https://guillaumejaume.github.io/FUNSD/work/
BibTeX: @inproceedings{jaume2019, title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents}, author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran}, bo
oktitle = {Accepted to ICDAR-OST}, year = {2019}}
MMOCR does not own the dataset. Using this dataset you must accept the license provided by the owners, and cite the corresponding papers appropriately.
If you do not agree with the above license, please cancel the progress immediately by pressing ctrl+c. Otherwise, you are deemed to accept the terms and conditions.
5...
4...
3...
2...
1...
Obtaining Dataset...
Extracting: funsd.zip
Converting Dataset...
Parsing train split...
[>                                                 ] 5/149, 1.1 task/s, elapsed: 5s, ETA:   134smultiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\xx\miniconda3\envs\xxx\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\funsd_parser.py", line 27, in parse_file
    for poly, text, ignore in self.loader(json_file):
  File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\funsd_parser.py", line 34, in loader
    data = json.load(f)
  File "C:\Users\xx\miniconda3\envs\xxx\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\xx\miniconda3\envs\traindet\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to <undefined>
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tools/dataset_converters/prepare_dataset.py", line 58, in <module>
    main()
  File "tools/dataset_converters/prepare_dataset.py", line 54, in main
    preparer()
  File "c:\users\xx\mmocr\mmocr\datasets\preparers\data_preparer.py", line 68, in __call__
    self.data_converter()
  File "c:\users\xx\mmocr\mmocr\datasets\preparers\data_converter.py", line 88, in __call__
    samples = self.parser.parse_files(files, self.current_split)
  File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\base.py", line 47, in parse_files
    samples = track_parallel_progress(func, files, nproc=self.nproc)
  File "C:\Users\xx\miniconda3\envs\xxx\lib\site-packages\mmengine\utils\progressbar.py", line 164, in track_parallel_progress
    for result in gen:
  File "C:\Users\xx\miniconda3\envs\traindet\lib\multiprocessing\pool.py", line 868, in next
    raise value
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to <undefined>

Additional information

No response

hchintada avatar Feb 17 '23 10:02 hchintada

Thanks for your report! Maybe try to replace https://github.com/open-mmlab/mmocr/blob/0894178343af5589400ea145a861e709eef63071/mmocr/datasets/preparers/parsers/funsd_parser.py#L33 with with open(file_path, 'r', encoding='utf-8') as f:

Harold-lkk avatar Feb 23 '23 03:02 Harold-lkk