Utility for syncing training, validation, and evaluation data.

Open markmester opened this issue 6 years ago • 1 comments

On most of the datasets I'm putting together, there is not always a 1-1 matching of masks to tiles. At the very least there should be clarification that the trainer needs a directory where all files are in sync. Even better would be to provide a simple pre-processing script for syncing the masks/tiles or in rs_trainer provide an option to ignore or remove un-synced masks/tiles.

Currently I just use a simple python script to sync the directory:

import os
import argparse

def dir_dict(dir: str) -> dict:
    dd = {}

    for subdir, dirs, files in os.walk(dir):
        for file in files:
            f = '/'.join(os.path.join(subdir, file).split("/")[-3:])
            dd[f] = os.path.join(subdir, file)

    return dd

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('dir1', type=str)
    parser.add_argument('dir2', type=str)
    args = parser.parse_args()

    removed = []

    dir1_dict = dir_dict(args.dir1)
    dir2_dict = dir_dict(args.dir2)

    for k, v in dir1_dict.items():
        if not dir2_dict.get(k):
            removed.append(v)
    
    for k, v in dir2_dict.items():
        if not dir1_dict.get(k):
            removed.append(v)

    for file in removed:
        os.remove(file)
        
    return len(removed)

    
if __name__ == "__main__":
    print ( f"removed {main()} un-synced files" )

Oct 16 '19 23:10 markmester

See https://github.com/mapbox/robosat/issues/93 and https://github.com/mapbox/robosat/issues/93#issuecomment-408142081

We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of #91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.

rs train's pre-conditions are a dataset directory with pairs of images and labels.

I agree with you we could make it clear in the readme, though.

Would you be so kind and open a pull request explaining this? Thanks!

Oct 24 '19 19:10 daniel-j-h