UTF-8 caracter in filename
Describe the bug I'm using label-studio to annotate data for yolov5 training.
To Reproduce After annotation I do an export in YOLO format which was working great until I started a new project. This time I don't get the same count of labels and images (252 images and 266 label text files after export).
Expected behavior Same number of images and labels in YOLO export, with filename like "Capture_décran" or "Capture_d%C3%A9cran" as long as they match.
Describe the bug I tracked down the problem and all the missing image files are all containing an "é" (which is very common in french).
The original image file name are starting by "Capture_décran" (which mean "screenshot") but the exported label text files are all "Capture_d%C3%A9cran" It seems that the "é" caracter is causing the problem because C3A9 is the UTF-8 hex for it.
It look like this in Label Studio:

But for the first one the exported .txt file is called "Capture_d%C3%A9cran_de_2021-06-07_12-05-52.txt"
Environment:
- OS: Ubuntu 20.04.2 LTS
- Label Studio Version : 1.1
Hi, thx for report.
It looks like a bug.
We will figure out what is going wrong.
This issue will be fixed with the new export mechanics release.
I have faced the same issue recently with 1.7
Do you use 1.7.0?
Do you use 1.7.0?
I ran a pip install label-studio last week. Dunno if it makes any difference but I'm running it on WSL on Ubuntu 20.04.4 LTS (GNU/Linux 5.10.102.1-microsoft-standard-WSL2 x86_64). Version info:
{
"release": "1.7.2",
"label-studio-os-package": {
"version": "1.7.2",
"short_version": "1.7",
"latest_version_from_pypi": "1.7.3",
"latest_version_upload_time": "2023-04-19T12:05:18",
"current_version_is_outdated": true
},
"label-studio-os-backend": {
"message": "fix: LSDV-4740: Video Rectangles are displaying while drawing (#1258) ...",
"commit": "3de2ace9b53cf3ab213054a9106079aaf61796a7",
"date": "2023-03-20 18:10:51 +0400",
"branch": "HEAD",
"version": "1.7.2+0.g3de2ace"
},
"label-studio-frontend": {
"message": "fix: LSDV-4740: Video Rectangles are displaying while drawing (#1258)",
"commit": "9329687afa56d7491f96f4aa2df81b15c12adf7f",
"branch": "ls-release/1.7.2",
"date": "2023/03/20 11:03:14"
},
"dm2": {
"message": "Trigger build",
"commit": "2ee2ddaf9596b9e600bbaf231768acd16815e498",
"branch": "ls-release/1.7.2",
"date": "2023-03-09T16:36:40Z"
},
"label-studio-converter": {
"version": "0.0.50"
},
"label-studio-ml": {
"version": "1.0.8"
}
}
I have a similar problem already during upload. [8e00cc4f-a304-47e8-a4c1-74b0bf84c090] 'ascii' codec can't encode character '\xd6' in position 47: ordinal not in range(128) It's really crazy how many projects do still use ASCII, 20 years after standardization of Unicode.
@renepeinl , I have a similar issue. How did you resolve it?
@floschne I didn't. It was just during evaluation of Label Studio and I don't use it anymore. However, there is not really a better free of charge solution as far as I know.
@heidi-humansignal Hi, I faced same issue today on v1.13.1. Is this problem still remain?
Error logs for localfiles storage 22 in project 42 and job null:
File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 37-45: surrogates not allowed