label-studio icon indicating copy to clipboard operation
label-studio copied to clipboard

UTF-8 caracter in filename

Open Theo-HENAFF opened this issue 4 years ago • 8 comments

Describe the bug I'm using label-studio to annotate data for yolov5 training.

To Reproduce After annotation I do an export in YOLO format which was working great until I started a new project. This time I don't get the same count of labels and images (252 images and 266 label text files after export).

Expected behavior Same number of images and labels in YOLO export, with filename like "Capture_décran" or "Capture_d%C3%A9cran" as long as they match.

Describe the bug I tracked down the problem and all the missing image files are all containing an "é" (which is very common in french).

The original image file name are starting by "Capture_décran" (which mean "screenshot") but the exported label text files are all "Capture_d%C3%A9cran" It seems that the "é" caracter is causing the problem because C3A9 is the UTF-8 hex for it.

It look like this in Label Studio: Capture d’écran du 2021-08-10 09-29-07

But for the first one the exported .txt file is called "Capture_d%C3%A9cran_de_2021-06-07_12-05-52.txt"

Environment:

  • OS: Ubuntu 20.04.2 LTS
  • Label Studio Version : 1.1

Theo-HENAFF avatar Aug 10 '21 07:08 Theo-HENAFF

Hi, thx for report.

It looks like a bug.

We will figure out what is going wrong.

chiganov avatar Sep 30 '21 14:09 chiganov

This issue will be fixed with the new export mechanics release.

makseq avatar Oct 12 '21 01:10 makseq

I have faced the same issue recently with 1.7

csanadpoda avatar Apr 25 '23 15:04 csanadpoda

Do you use 1.7.0?

makseq avatar May 03 '23 02:05 makseq

Do you use 1.7.0?

I ran a pip install label-studio last week. Dunno if it makes any difference but I'm running it on WSL on Ubuntu 20.04.4 LTS (GNU/Linux 5.10.102.1-microsoft-standard-WSL2 x86_64). Version info:

{
    "release": "1.7.2",
    "label-studio-os-package": {
        "version": "1.7.2",
        "short_version": "1.7",
        "latest_version_from_pypi": "1.7.3",
        "latest_version_upload_time": "2023-04-19T12:05:18",
        "current_version_is_outdated": true
    },
    "label-studio-os-backend": {
        "message": "fix: LSDV-4740: Video Rectangles are displaying while drawing (#1258)  ...",
        "commit": "3de2ace9b53cf3ab213054a9106079aaf61796a7",
        "date": "2023-03-20 18:10:51 +0400",
        "branch": "HEAD",
        "version": "1.7.2+0.g3de2ace"
    },
    "label-studio-frontend": {
        "message": "fix: LSDV-4740: Video Rectangles are displaying while drawing (#1258)",
        "commit": "9329687afa56d7491f96f4aa2df81b15c12adf7f",
        "branch": "ls-release/1.7.2",
        "date": "2023/03/20 11:03:14"
    },
    "dm2": {
        "message": "Trigger build",
        "commit": "2ee2ddaf9596b9e600bbaf231768acd16815e498",
        "branch": "ls-release/1.7.2",
        "date": "2023-03-09T16:36:40Z"
    },
    "label-studio-converter": {
        "version": "0.0.50"
    },
    "label-studio-ml": {
        "version": "1.0.8"
    }
}

csanadpoda avatar May 04 '23 11:05 csanadpoda

I have a similar problem already during upload. [8e00cc4f-a304-47e8-a4c1-74b0bf84c090] 'ascii' codec can't encode character '\xd6' in position 47: ordinal not in range(128) It's really crazy how many projects do still use ASCII, 20 years after standardization of Unicode.

renepeinl avatar Feb 15 '24 09:02 renepeinl

@renepeinl , I have a similar issue. How did you resolve it?

floschne avatar Sep 09 '24 15:09 floschne

@floschne I didn't. It was just during evaluation of Label Studio and I don't use it anymore. However, there is not really a better free of charge solution as far as I know.

renepeinl avatar Sep 10 '24 11:09 renepeinl

@heidi-humansignal Hi, I faced same issue today on v1.13.1. Is this problem still remain?

Error logs for localfiles storage 22 in project 42 and job null:

  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 37-45: surrogates not allowed

yellowjs0304 avatar Jan 22 '25 07:01 yellowjs0304