Utility for syncing training, validation, and evaluation data.
On most of the datasets I'm putting together, there is not always a 1-1 matching of masks to tiles. At the very least there should be clarification that the trainer needs a directory where all files are in sync. Even better would be to provide a simple pre-processing script for syncing the masks/tiles or in rs_trainer provide an option to ignore or remove un-synced masks/tiles.
Currently I just use a simple python script to sync the directory:
import os
import argparse
def dir_dict(dir: str) -> dict:
dd = {}
for subdir, dirs, files in os.walk(dir):
for file in files:
f = '/'.join(os.path.join(subdir, file).split("/")[-3:])
dd[f] = os.path.join(subdir, file)
return dd
def main():
parser = argparse.ArgumentParser()
parser.add_argument('dir1', type=str)
parser.add_argument('dir2', type=str)
args = parser.parse_args()
removed = []
dir1_dict = dir_dict(args.dir1)
dir2_dict = dir_dict(args.dir2)
for k, v in dir1_dict.items():
if not dir2_dict.get(k):
removed.append(v)
for k, v in dir2_dict.items():
if not dir1_dict.get(k):
removed.append(v)
for file in removed:
os.remove(file)
return len(removed)
if __name__ == "__main__":
print ( f"removed {main()} un-synced files" )
See https://github.com/mapbox/robosat/issues/93 and https://github.com/mapbox/robosat/issues/93#issuecomment-408142081
We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of #91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.
rs train's pre-conditions are a dataset directory with pairs of images and labels.
I agree with you we could make it clear in the readme, though.
Would you be so kind and open a pull request explaining this? Thanks!