fiftyone
fiftyone copied to clipboard
[BUG] fo.Dataset.from_dir() does not work for dataset_type=fo.types.VOCDetectionDataset if images, labels are in the same directory
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.15.7
- FiftyOne installed from (pip or source): pip
-
FiftyOne version (run
fiftyone --version
): FiftyOne v0.15.1, Voxel51, Inc. - Python version: Python 3.6.8
Commands to reproduce
As thoroughly as possible, please provide the Python and/or shell commands used to encounter the issue. Application steps can be described in the next section.
import fiftyone as fo
# I put my .jpg and VOC .xml files in the same directory. foo.jpg has a foo.xml label file for example.
dataset_images_path = '/data/scraped_jan_2022'
dataset_labels_path = '/data/scraped_jan_2022'
dataset_name = 'scraped_jan_2022'
dataset = fo.Dataset.from_dir(dataset_type=fo.types.VOCDetectionDataset, data_path=dataset_path, labels_path=dataset_path, name=dataset_name)
The resulting Dataset
has no images in it.
Describe the problem
When importing fo.types.VOCDetectionDataset images, labels via fo.Dataset.from_dir()
the resulting Dataset
will be empty if the VOC .xml files are in the same directory as the image files.
It is common the have foo.xml VOC label files alongside foo.jpg in the same directory. But from_dir()
doesn't handle this. The resulting Dataset
will be empty.
The problem appears to be that ImportPathsMixin._load_data_map()
returns a dict whose values are not .jpg file paths, but instead of .xml file paths.
Note that eta.core.utils.list_files()
, which gets called by ImportPathsMixin._load_data_map()
, sorts the files. So foo.jpg
will be found first, and foo.xml
will be found second. Thus the xml file path overwrites the jpg file path in the dict created by ImportPathsMixin._load_data_map()
.
Code to reproduce issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Please do not use screenshots for sharing text. Code snippets should be used instead when providing tracebacks, logs, etc.
What areas of FiftyOne does this bug affect?
- [ ]
App
: FiftyOne application issue - [X]
Core
: Corefiftyone
Python library issue - [ ]
Server
: Fiftyone server issue
Willingness to contribute
The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?
- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the FiftyOne community.
- [ ] No. I cannot contribute a bug fix at this time.
Not really a bug because the code supports what is explicitly documented here, but I agree it would be nice to make the importer "smarter" so that everything can be in the same directory if desired.
I'm seeing a mismatch between the docs and the code. The docs that @brimoor points to only refer to how the dataset_dir
argument to fo.Dataset.from_dir()
works. But fo.Dataset.from_dir()
accepts labels_path
and data_path
arguments too. So I was expecting things to work if one specifies those arguments instead of using dataset_dir
.
@brimoor I narrowed down the problem to this function call: https://github.com/voxel51/fiftyone/blob/7133bbda4459e8fc9cfad876828ff5e95131156e/fiftyone/utils/voc.py#L184-L186
because it maps 'uuid' keys to .xml paths. To be more precise the uuid
keys are updated twice (once for images and a second time for the .xml files):
https://github.com/voxel51/fiftyone/blob/7133bbda4459e8fc9cfad876828ff5e95131156e/fiftyone/utils/data/importers.py#L715-L718
Currently I am thinking about a way of solving this issue. One way would be avoiding updating the uuid
key if the extensions of the values are not for an image.
Also the mapping uuid
-> labels
will require some attention. Any suggestions?
This issue should extend also to CVAT and Kitti formats. I would be able next weekend to work on this issue.