datumaro Unvalid work flow

Hi guys, thanks for this wonderful tool.

Lately I try to use datumaro to do some data management for object detection job, say, I have a coco_instances named "1.json", I did this work flow

datum create

# import only coco_instances.json
datum import -f coco_instances -n source1 <path/to/1.json>

# commit
datum commit -m "dataset added"


# Firstly I would like a data split, and put it in a lazy execution
datum transform -t split --apply false -- -t detection --subset train:.9 --subset val:.1 

# commit
datum commit -m "dataset split"

# Then I would like to filter some images i don't like, and put it in a lazy execution
datum filter -e '/item/annotation[label!="A" and label!="B"]' -m a+i source1 --apply false

# commit
datum commit -m "dataset filtered"

# export
datum export -f yolo -- --save-images

But unfortunately, I am not able to get a transformed or filtered yolo dataset.

The result yolo dataset is like I have NOT done anything. That is weird, did I misunderstand the work flow?

when I datum project info, I get

and datum log, I get

So how can I do transform and filter at the same time? I just can not figure out where I was wrong.

Thanks, any helpful would be great !

Sep 14 '22 10:09 miknyko

Hi! Glad you've found the tool useful. The commands you're using are correct, but when --apply=false is used, the behavior is more complicated. I'll try to explain.

With --apply=false the working copy of the dataset is not modified immediately. When the following commit is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.

There are few possible ways to obtain the required result:

(cli way) Manually edit proj/.datumaro/tree/config.yml and remove all the stage hashes after the first one before running the export command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting.
(cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded
(api way) Do all the operations from a simple python script like this:

import datumaro as dm
dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances')
dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)])
dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True)
dataset.export('output_dir/', 'yolo', save_images=True)

I agree that this particular situation and CLI experience in general needs to be improved.

Sep 14 '22 11:09 zhiltsov-max

Hi! Glad you've found the tool useful. The commands you're using are correct, but when --apply=false is used, the behavior is more complicated. I'll try to explain.

With --apply=false the working copy of the dataset is not modified immediately. When the following commit is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.

There are few possible ways to obtain the required result:

(cli way) Manually edit proj/.datumaro/tree/config.yml and remove all the stage hashes after the first one before running the export command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting.

(cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded

(api way) Do all the operations from a simple python script like this:
import datumaro as dm
dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances')
dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)])
dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True)
dataset.export('output_dir/', 'yolo', save_images=True)
I agree that this particular situation and CLI experience in general needs to be improved.

😄 Frankly, I was not expecting such a fast reply !

Actullay I prefer the api way, I will test it and then close this issue.

Thanks again!

Sep 15 '22 03:09 miknyko

datumaro datumaro copied to clipboard

Unvalid work flow

datumaro
datumaro copied to clipboard