cvat
cvat copied to clipboard
Exporting project with duplicate image names incorrect
Actions before raising this issue
- [X] I searched the existing issues and did not find anything similar.
- [X] I read/searched the docs
Steps to Reproduce
- Create a job with image names as follows
image_1.jpg,image_2.jpg,image_3.jpg,image_4.jpg, etc. - Create many jobs in the same project with image name of
image.jpg. I actually used the python sdk like follows to create this.
client.tasks.create(
cast(
TaskWriteRequest,
TaskWriteRequest("task", project_id=1),
)
)
task.upload_data(
resource_type=ResourceType.LOCAL,
resources=[str(img_path.absolute())],
params={
"image_quality": 85,
},
wait_for_completion=True,
)
- Attempt exporting the project with save images checked. Notice that the images in the first job with names
image_1.jpg, etc are overwritten by the images of the other jobs.
I believe CVAT attempts to rename the other jobs conflicting names by adding _1, _2, etc. ,but it doesn't account for those names existing in other jobs or in the current export dataset.
Expected Behavior
Images should not be overridden by images in other jobs when exporting a project
Possible Solution
No response
Context
No response
Environment
No response
I have an approach to solve this issue, we can edit the renaming mechanism by the following ways:
- Use unique identifiers (e.g., UUIDs, timestamps) to ensure no two images end up with the same name.
- include the job or task ID
These are some feasible reasons, please check and confirm @zhiltsov-max @alexyao2015
The second option seems like it would work as a simple fix. Alternatively, there could be a check to see if the file exists already in the export and append something else to the filename until it no longer conflicts.
Yeah that's right, should i work on this issue @alexyao2015 ?
That would be great. Please go ahead.
Assign me this issue,(◔‿◔)
@BarryByte, consider adding endpoint parameters and some UI elements to control the behavior (e.g. the prefix or filename pattern). It will be nice if you create a detailed description of the suggested changes first.
An even simpler way is to just rename all images to image_1, image_2, etc., without preserving the original filename.
@alexyao2015, it's already being done. The problem is that there is no way to find out the real source of the image in the exported dataset.
Right so as you are exporting images, you export and rename the image regardless of if it's overlapping. What's going on now is it's seeing a potentially duplicate name and renaming if it's duplicate. I would just use a simple counter, incrementing with each image, and export the images with a fixed name so it's impossible to have overlapping names.
@alexyao2015, yes, it will fix the problem with name collisions. But it doesn't solve the problem with determining the origin of the frame.
Have a map with the job id and original image name to the remapped image name in memory? Is there something I'm missing?
@alexyao2015, it's needed for users, not for export to work. The problem is: there were some images with some names in the tasks in the project. Then the project is exported in some format, with image names mangled. Now, the resulting dataset contains some modified frame names, and the user can't get their origin to do some further analysis of the exported dataset. They need to match the output names with source task or job names, but there is no way to determine this for the user.
Simple potential ways of solving the problem - provide an output mapping or change the added suffix from _N to _job_N.
Hi all,
Is there any update on this issue? Happy to help if needed.
Thanks, Benjamin
Hello, Curious if this issue was resolved, otherwise I would be happy contribute. Currently looking for an open source project bug I can work on for a school assignment.
@noahpav, no progress so far.
@zhiltsov-max I would be happy to work on this issue if you wanted to assign it to me. I also emailed you with some other questions.
Problem with the Code: The issue arises because the uploaded images have generic names (like image.jpg), which can conflict with files in other tasks or projects during dataset export. CVAT attempts to resolve conflicts by appending suffixes, but this approach is not foolproof.
Code:
import uuid
from pathlib import Path
from cvat_sdk import Client
from cvat_sdk.models import TaskWriteRequest, ResourceType
# Initialize the CVAT client
client = Client("http://your-cvat-instance.com", "username", "password")
client.login()
# Path to the image directory
image_dir = Path("/path/to/images")
# Preprocess images to ensure unique file names
processed_images = []
for img_path in image_dir.iterdir():
if img_path.is_file() and img_path.suffix in [".jpg", ".png"]:
unique_name = f"{uuid.uuid4()}_{img_path.name}"
unique_path = img_path.parent / unique_name
img_path.rename(unique_path)
processed_images.append(str(unique_path))
# Create a new task
task = client.tasks.create(
TaskWriteRequest(
name="Unique Task",
project_id=1, # Replace with your project ID
)
)
# Upload the processed images
task.upload_data(
resource_type=ResourceType.LOCAL,
resources=processed_images,
params={"image_quality": 85},
wait_for_completion=True,
)
print(f"Task {task.id} created and images uploaded successfully!")
@zhiltsov-max I have been working on implementing a fix similar to the one suggested above, where I would append the task and project ID to filenames in cases of conflicts during export. However, I’ve been unable to locate where the actual renaming occurs in the code. I initially focused on dump_media_files() and the @exporter(name='CVAT for images', ext='ZIP', version='1.1') decorator in cvat/apps/dataset_manager/formats/cvat.py, as they seem to handle the export process for the CVAT for images 1.1 format. Despite adding debug statements, I haven't seen any output during the export process, which suggests the function might not be executing at all. I identified these functions by tracing the API calls made during export, but I’m unsure if there’s an upstream process or a different exporter handling this format. If you have any suggestions on where the renaming might occur or where else I should investigate, I’d greatly appreciate your guidance.
If locating the exact renaming logic proves difficult or impractical, consider the following alternatives:
a. Prevent Naming Conflicts Upfront As demonstrated in the code snippet, preprocess filenames to include a unique identifier before upload. Enhance this by adding task and project IDs:
unique_name = f"{task.id}_{project_id}_{uuid.uuid4()}_{img_path.name}"
This makes filenames globally unique across tasks and projects.
b. Customize the Export Logic Modify the @Exporter decorator logic:
Ensure filenames exported include additional identifiers (e.g., task_id, project_id). This may involve extending dump_media_files() or related exporter functions. c. Patch Dataset Exporter If CVAT for images is the exporter in question:
Add a patch to append a unique suffix to filenames during the zip-writing process. 4. Testing and Validation Once you’ve added your fix:
Export a dataset with conflicting filenames to verify the renaming logic works as intended. Use both the UI and API to ensure consistent behavior.
I don't think we need chatgpt generated comments sending people on a wild goose chase to solve an issue...
Hi, @noahpav. Sorry for the delay. I guess it will be better to actually add a suffix like -task_<task_id>, as proposed above. This will solve the problem and also make it possible to identify the source task of the image. Please make sure to test it with the case from the first message. The mangle_image_name() function in bindings.py is responsible for renaming.
Hi everyone,
I wanted to add my thoughts regarding the issue of image name conflicts during dataset export, as discussed in this thread.
Actions Before Raising This Issue
I searched existing issues and did not find anything similar. I reviewed the documentation for relevant information.
Steps to Reproduce
Create a job with images named image_1.jpg, image_2.jpg, etc. Create multiple jobs in the same project with images named image.jpg. Use the Python SDK to create tasks and upload images. Attempt to export the project with the "save images" option checked. As noted, the images from the first job are overwritten by those from the other jobs due to naming conflicts.
Expected Behavior
Images should retain their original names without being overridden by images from other jobs during export.
Possible Solution
I agree with the suggestions made in this thread regarding renaming strategies. Here are a couple of approaches that could work:
Use Unique Identifiers: Appending unique identifiers (like UUIDs or timestamps) to filenames can ensure that no two images have the same name. This would prevent conflicts during export.
Include Task or Job IDs: Modifying the renaming logic to include the task or job ID in the filename (e.g., image_task_<task_id>.jpg) would help maintain the association between the exported images and their source tasks.
Mapping for Original Names: As mentioned, maintaining a mapping of original names to the new names in memory could help users trace back the exported images to their source tasks, which is crucial for further analysis.
Context
I appreciate the ongoing discussions and the proposed solutions. I believe that implementing a suffix like -task_<task_id> during the export process would be a straightforward fix that addresses both the naming conflict and the need for traceability.
If there are any updates or if further assistance is needed in implementing these changes, please let me know. I’m happy to help!
Sorry for the late response.