datumaro
datumaro copied to clipboard
Unvalid work flow
Hi guys, thanks for this wonderful tool.
Lately I try to use datumaro to do some data management for object detection job, say, I have a coco_instances named "1.json", I did this work flow
datum create
# import only coco_instances.json
datum import -f coco_instances -n source1 <path/to/1.json>
# commit
datum commit -m "dataset added"
# Firstly I would like a data split, and put it in a lazy execution
datum transform -t split --apply false -- -t detection --subset train:.9 --subset val:.1
# commit
datum commit -m "dataset split"
# Then I would like to filter some images i don't like, and put it in a lazy execution
datum filter -e '/item/annotation[label!="A" and label!="B"]' -m a+i source1 --apply false
# commit
datum commit -m "dataset filtered"
# export
datum export -f yolo -- --save-images
But unfortunately, I am not able to get a transformed or filtered yolo dataset.
The result yolo dataset is like I have NOT done anything. That is weird, did I misunderstand the work flow?
when I datum project info
, I get
and datum log
, I get
So how can I do transform and filter at the same time? I just can not figure out where I was wrong.
Thanks, any helpful would be great !
Hi! Glad you've found the tool useful. The commands you're using are correct, but when --apply=false
is used, the behavior is more complicated. I'll try to explain.
With --apply=false
the working copy of the dataset is not modified immediately. When the following commit
is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.
There are few possible ways to obtain the required result:
- (cli way) Manually edit
proj/.datumaro/tree/config.yml
and remove all the stage hashes after the first one before running theexport
command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting. - (cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded
- (api way) Do all the operations from a simple python script like this:
import datumaro as dm
dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances')
dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)])
dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True)
dataset.export('output_dir/', 'yolo', save_images=True)
I agree that this particular situation and CLI experience in general needs to be improved.
Hi! Glad you've found the tool useful. The commands you're using are correct, but when
--apply=false
is used, the behavior is more complicated. I'll try to explain.With
--apply=false
the working copy of the dataset is not modified immediately. When the followingcommit
is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.There are few possible ways to obtain the required result:
- (cli way) Manually edit
proj/.datumaro/tree/config.yml
and remove all the stage hashes after the first one before running theexport
command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting.- (cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded
- (api way) Do all the operations from a simple python script like this:
import datumaro as dm dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances') dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)]) dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True) dataset.export('output_dir/', 'yolo', save_images=True)
I agree that this particular situation and CLI experience in general needs to be improved.
😄 Frankly, I was not expecting such a fast reply !
Actullay I prefer the api way, I will test it and then close this issue.
Thanks again!