mmocr
mmocr copied to clipboard
UnicodeDecodeError: 'charmap' codec can't decode byte - while running prepare_dataset.py on funsd dataset
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (0.x) or latest version (1.x).
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmocr
Environment
sys.platform: win32
Python: 3.8.16 (default, Jan 17 2023, 22:25:28) [MSC v.1916 64 bit (AMD64)]
CUDA available: False
numpy_random_seed: 2147483648
MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.34.31937 for x64
GCC: n/a
PyTorch: 1.13.1
PyTorch compiling details: PyTorch built with:
- C++ Version: 199711
- MSVC 192829337
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTH READPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DED GE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, US E_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.14.1 OpenCV: 4.7.0 MMEngine: 0.1.0 MMOCR: 1.0.0rc5+27b6a68
Reproduces the problem - code sample
python mmocr/utils/collect_env.py
Reproduces the problem - command or script
python mmocr/utils/collect_env.py
Reproduces the problem - error message
Dataset Name: FUNSD
License Type: FUNSD License
License Link: https://guillaumejaume.github.io/FUNSD/work/
BibTeX: @inproceedings{jaume2019, title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents}, author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran}, bo
oktitle = {Accepted to ICDAR-OST}, year = {2019}}
MMOCR does not own the dataset. Using this dataset you must accept the license provided by the owners, and cite the corresponding papers appropriately.
If you do not agree with the above license, please cancel the progress immediately by pressing ctrl+c. Otherwise, you are deemed to accept the terms and conditions.
5...
4...
3...
2...
1...
Obtaining Dataset...
Extracting: funsd.zip
Converting Dataset...
Parsing train split...
[> ] 5/149, 1.1 task/s, elapsed: 5s, ETA: 134smultiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\xx\miniconda3\envs\xxx\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\funsd_parser.py", line 27, in parse_file
for poly, text, ignore in self.loader(json_file):
File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\funsd_parser.py", line 34, in loader
data = json.load(f)
File "C:\Users\xx\miniconda3\envs\xxx\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\xx\miniconda3\envs\traindet\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to <undefined>
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "tools/dataset_converters/prepare_dataset.py", line 58, in <module>
main()
File "tools/dataset_converters/prepare_dataset.py", line 54, in main
preparer()
File "c:\users\xx\mmocr\mmocr\datasets\preparers\data_preparer.py", line 68, in __call__
self.data_converter()
File "c:\users\xx\mmocr\mmocr\datasets\preparers\data_converter.py", line 88, in __call__
samples = self.parser.parse_files(files, self.current_split)
File "c:\users\xx\mmocr\mmocr\datasets\preparers\parsers\base.py", line 47, in parse_files
samples = track_parallel_progress(func, files, nproc=self.nproc)
File "C:\Users\xx\miniconda3\envs\xxx\lib\site-packages\mmengine\utils\progressbar.py", line 164, in track_parallel_progress
for result in gen:
File "C:\Users\xx\miniconda3\envs\traindet\lib\multiprocessing\pool.py", line 868, in next
raise value
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to <undefined>
Additional information
No response
Thanks for your report! Maybe try to replace https://github.com/open-mmlab/mmocr/blob/0894178343af5589400ea145a861e709eef63071/mmocr/datasets/preparers/parsers/funsd_parser.py#L33
with with open(file_path, 'r', encoding='utf-8') as f: