gateplugin-LearningFramework icon indicating copy to clipboard operation
gateplugin-LearningFramework copied to clipboard

Add a way to randomly shuffle the corpus / data file

Open johann-petrak opened this issue 6 years ago • 1 comments

It can happen that a corpus contains training instances grouped by class which is very bad for training. In such cases there should be a way to either shuffle the corpus before running the pipeline with the training PR on it, or to shuffle the generated data file before using it (and before splitting of the validation instances).

Doing it inside GATE by providing a meny entry for shuffling on a corpus:

  • this should be easy to implement using Collections.shuffle(list,random) given that a corpus is a list
  • more involved to support "unshuffling"
  • cannot be used in other scenarios (e.g. using runPipeline, GCP) where we may need to shuffle the data file that was created

Shuffling the data file:

  • shuf on Linux works well, but: no easy way to provide repeatable randomness through a seed, unknown how well it scales beyond available memory
  • not sure what other scalable, portable ways to shuffle exist, maybe:
    • https://github.com/trufanov-nok/shuf-t
    • probably better to implement our own python-based approach with two iterations: first index the starting offsets and lengths of all lines into memory, shuffle that list, the seek and write lines in the shuffled order. If file is not too big, just shuffle lines in memory directly.

johann-petrak avatar Sep 03 '18 11:09 johann-petrak

Implement a Python utility function in the gate-lf-python-data library for doing this by either directly loading all ines into memory, if possible, or creating a list of starting offsets (maybe lengths) and using seek.

Roughly:

idx.append(curoffset, curlinelength)
curoffset += curlinelength
...
# shuffle idx, the go through it and ...
thefile.seek(offset)
line=thefile.readline() # in that case no need to store length, but maybe using lower level read giving length is faster?
# write line to shuffled file

johann-petrak avatar Sep 03 '18 11:09 johann-petrak