persephone
persephone copied to clipboard
Corpus reader can create batch size of 0 even when training prefixes are present
There's a situation I ran into with the test_fast
in test_na.py
where if the batch size is not specified the CorpusReader creates a batch size of 0. This seems to be a bug.
https://github.com/persephone-tools/persephone/blob/d73db4597a9ff92f6ea8824e6070b662a403655b/persephone/tests/experiments/test_na.py#L97-L122
Specifically on line 126 get_simple_model
determines the batch size and passes it in as a parameter. Creating a simple test case like this could provide a nice regression test here.
Added some tests for utils.make_batch
in #189 and it appears the issue isn't from there. Will work on this more tomorrow.
I see in the constructor there's the following:
if not num_train:
if not batch_size:
batch_size = 64
num_train = len(corpus.get_train_fns()[0])
This will find the num_train
from how many feature files were found for the training.
Do we want to warn if no feature files are found? It seems as though this class has a precondition that the feature files are already preprocessed and that a warning would make sense here.
I just pushed a check in train_batch_gen() that throws an exception if the Corpus has no training utterances.
I'm removing the bug status now, I'll keep the issue open since there's an open question of whether to move some of that batch_size logic in the constructor to .train_batch_gen() since it's only relevant to that. The reason I don't want to do this immediately is because it's probably a good idea to load the dev and test data in batches too since conceivably they won't all fit in memory.
Perhaps the simplest option is to just have the batch_size argument have a kwarg default that is reasonable (32 or something) and use this in generating train, dev and test batches. I can get rid of most of the logic, including that exception that talks about divisibility of num_train and batch_size I made a while back. Who cares, the model should just feed the remainder in as a batch smaller than the others.