persephone icon indicating copy to clipboard operation
persephone copied to clipboard

Corpus reader can create batch size of 0 even when training prefixes are present

Open shuttle1987 opened this issue 6 years ago • 4 comments

There's a situation I ran into with the test_fast in test_na.py where if the batch size is not specified the CorpusReader creates a batch size of 0. This seems to be a bug.

shuttle1987 avatar Aug 19 '18 15:08 shuttle1987

https://github.com/persephone-tools/persephone/blob/d73db4597a9ff92f6ea8824e6070b662a403655b/persephone/tests/experiments/test_na.py#L97-L122

Specifically on line 126 get_simple_model determines the batch size and passes it in as a parameter. Creating a simple test case like this could provide a nice regression test here.

shuttle1987 avatar Aug 19 '18 15:08 shuttle1987

Added some tests for utils.make_batch in #189 and it appears the issue isn't from there. Will work on this more tomorrow.

shuttle1987 avatar Aug 19 '18 16:08 shuttle1987

I see in the constructor there's the following:

        if not num_train:
            if not batch_size:
                batch_size = 64
            num_train = len(corpus.get_train_fns()[0])

This will find the num_train from how many feature files were found for the training.

Do we want to warn if no feature files are found? It seems as though this class has a precondition that the feature files are already preprocessed and that a warning would make sense here.

shuttle1987 avatar Aug 20 '18 05:08 shuttle1987

I just pushed a check in train_batch_gen() that throws an exception if the Corpus has no training utterances.

I'm removing the bug status now, I'll keep the issue open since there's an open question of whether to move some of that batch_size logic in the constructor to .train_batch_gen() since it's only relevant to that. The reason I don't want to do this immediately is because it's probably a good idea to load the dev and test data in batches too since conceivably they won't all fit in memory.

Perhaps the simplest option is to just have the batch_size argument have a kwarg default that is reasonable (32 or something) and use this in generating train, dev and test batches. I can get rid of most of the logic, including that exception that talks about divisibility of num_train and batch_size I made a while back. Who cares, the model should just feed the remainder in as a batch smaller than the others.

oadams avatar Aug 20 '18 15:08 oadams