cocosplit icon indicating copy to clipboard operation
cocosplit copied to clipboard

--multi-class not splitting images with multiple boxes correctly

Open fama0 opened this issue 2 years ago • 3 comments

I would like to use the --multi-class option, but I noticed that it would only split the image and the first bounding box declaration into the validation set. If the image had more boxes, those would be left orphaned on the training set file. Removing the flag allows that same image to have all it's box definitions in the validation set. I checked that this same bug is in the ahmad-ra repository so it's not something that came up due to the merge

I don't know if this is the best way of fixing it, but I took the working code as an example and modified the multi-class specific one to filter_annotations in hopes of getting all the annotations for the image. That did work 2022-08-22-cocosplit_mod

fama0 avatar Aug 22 '22 20:08 fama0

Looks like that was an incomplete fix. With that, any images that have multiple boxes in the validation set are created correctly there, but they will also be created in the training set. The problem seems to be that it's splitting the annotations properly by class, but doesn't take into account that some of those split annotations are in the same file.

fama0 avatar Aug 22 '22 20:08 fama0

I'm sure there's a better way actually using iterative_train_test_split, but here is my quick and dirty fix. It gets all the files which were set as validation and removes them from the training set before grabbing all their annotations. This does mean the split won't be as requested and will actually be very different between the two modes. This gives priority to the test set

if args.multi_class:

        annotation_categories = funcy.lmap(lambda a: int(a['category_id']), annotations)

        #bottle neck 1
        #remove classes that has only one sample, because it can't be split into the training and testing sets
        annotation_categories =  funcy.lremove(lambda i: annotation_categories.count(i) <=1  , annotation_categories)

        filtered_annotations =  funcy.lremove(lambda i: i['category_id'] not in annotation_categories  , annotations)

        X_train, y_train, X_test, y_test = iterative_train_test_split(np.array([filtered_annotations]).T,np.array([ annotation_categories]).T, test_size = 1-args.split)

        img_train = filter_images(images, X_train.reshape(-1))
        img_test = filter_images(images, X_test.reshape(-1))
        
        image_test_ids = funcy.lmap(lambda i: int(i['id']), img_test)
        img_train = funcy.lremove(lambda a: int(a['id']) in image_test_ids, img_train)
        
        anns_train = filter_annotations(annotations, img_train)
        anns_test = filter_annotations(annotations, img_test)
        
        save_coco(args.train, info, licenses, img_train, anns_train, categories)
        save_coco(args.test, info, licenses,  img_test, anns_test, categories)

        print("Saved {} entries in {} and {} in {}".format(len(anns_train), args.train, len(anns_test), args.test))

fama0 avatar Aug 22 '22 22:08 fama0

@fama0 This seems to work well, thanks!

Arunavameister avatar Sep 20 '22 11:09 Arunavameister