datumaro icon indicating copy to clipboard operation
datumaro copied to clipboard

Unvalid work flow

Open miknyko opened this issue 2 years ago • 2 comments

Hi guys, thanks for this wonderful tool.

Lately I try to use datumaro to do some data management for object detection job, say, I have a coco_instances named "1.json", I did this work flow

datum create

# import only coco_instances.json
datum import -f coco_instances -n source1 <path/to/1.json>

# commit
datum commit -m "dataset added"


# Firstly I would like a data split, and put it in a lazy execution
datum transform -t split --apply false -- -t detection --subset train:.9 --subset val:.1 

# commit
datum commit -m "dataset split"

# Then I would like to filter some images i don't like, and put it in a lazy execution
datum filter -e '/item/annotation[label!="A" and label!="B"]' -m a+i source1 --apply false

# commit
datum commit -m "dataset filtered"

# export
datum export -f yolo -- --save-images

But unfortunately, I am not able to get a transformed or filtered yolo dataset.

The result yolo dataset is like I have NOT done anything. That is weird, did I misunderstand the work flow?

when I datum project info, I get image

and datum log, I get image

So how can I do transform and filter at the same time? I just can not figure out where I was wrong.

Thanks, any helpful would be great !

miknyko avatar Sep 14 '22 10:09 miknyko

Hi! Glad you've found the tool useful. The commands you're using are correct, but when --apply=false is used, the behavior is more complicated. I'll try to explain.

With --apply=false the working copy of the dataset is not modified immediately. When the following commit is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.

There are few possible ways to obtain the required result:

  • (cli way) Manually edit proj/.datumaro/tree/config.yml and remove all the stage hashes after the first one before running the export command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting.
  • (cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded
  • (api way) Do all the operations from a simple python script like this:
import datumaro as dm
dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances')
dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)])
dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True)
dataset.export('output_dir/', 'yolo', save_images=True)

I agree that this particular situation and CLI experience in general needs to be improved.

zhiltsov-max avatar Sep 14 '22 11:09 zhiltsov-max

Hi! Glad you've found the tool useful. The commands you're using are correct, but when --apply=false is used, the behavior is more complicated. I'll try to explain.

With --apply=false the working copy of the dataset is not modified immediately. When the following commit is called, Datumaro computes and records the dataset hash - and the dataset is not modified in the working tree. You can see that there are 2 equal data hashes in the stage info output. Then, when you export, Datumaro restores the latest available stage - and it is the same as the original dataset.

There are few possible ways to obtain the required result:

  • (cli way) Manually edit proj/.datumaro/tree/config.yml and remove all the stage hashes after the first one before running the export command. This way, Datumaro will restore the original source data and re-apply the stages added during exporting.
  • (cli way) remove (rename) the working copy dir before commiting the changes, so that the stage hash will no be computed and recorded
  • (api way) Do all the operations from a simple python script like this:
import datumaro as dm
dataset = dm.Dataset.import_from('path/to/1.json', 'coco_instances')
dataset.transform('split', splits=[('train', 0.9), ('val', 0.1)])
dataset.filter(expr='/item/annotation[label!="A" and label!="B"]', filter_annotations=True, remove_empty=True)
dataset.export('output_dir/', 'yolo', save_images=True)

I agree that this particular situation and CLI experience in general needs to be improved.

😄 Frankly, I was not expecting such a fast reply !

Actullay I prefer the api way, I will test it and then close this issue.

Thanks again!

miknyko avatar Sep 15 '22 03:09 miknyko