Bug in handling of gzipped input files
The command line help indicates that gzipped input files are supported. However, if a gzipped training data file or validation data file is given, training fails with UnicodeDecodeError.
File "/l/sgronroo/scratch/theanopy3/bin/theanolm", line 12, in
exec(compile(open(file).read(), file, 'exec')) File "/l/sgronroo/scratch/theanopy3/theanolm/bin/theanolm", line 46, in main() File "/l/sgronroo/scratch/theanopy3/theanolm/bin/theanolm", line 41, in main args.command_function(args) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/commands/train.py", line 303, in train trainer.train() File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/trainers/basictrainer.py", line 132, in train self._validate(perplexity) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/trainers/localstatisticstrainer.py", line 57, in _validate perplexity = self.scorer.compute_perplexity(self.validation_iter) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/scoring/textscorer.py", line 130, in compute_perplexity for word_ids, _, mask in batch_iter: File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 93, in next sequence = self._read_sequence() File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 168, in _read_sequence for word in utterance_from_line(line)] File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 20, in utterance_from_line line = line.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
It appears that the problem is caused by the mmap access to the file (SentencePointers in iterators/shufflingbatchiterator:66) failing for gzipped files. The transparent unzipping (implemented in TextFileType filetypes.py:95) has no effect when using mmap.
I'm thinking about this kind of solution:
- Provide a temporary directory as a command line argument.
- Store two arrays using
numpy.memmap(): one that contains the word IDs and one that contains the mask of each sequence. - Iterate the data by accessing these arrays in random order.