seqlearn
seqlearn copied to clipboard
[WIP] parallel perceptron training
I tried to implement "iterative parameter mixing" strategy for distributed training of structured perceptron:
Ryan Mcdonald, Keith Hall, and Gideon Mann (2010) Distributed training strategies for the structured perceptron. NAACL'10.
The idea is the following:
- training data is split into N "shards" (this happens only once);
- for each shard OneEpochPerceptron is created - this could happen on different machine;
- all OneEpochPerceptrons start with the same weights (but with different training data);
- at the end of each iteration learned weights from different perceptrons are collected and mixed together; mixed values are passed to all perceptrons on next iteration (all perceptrons receive the same state again).
So communication should involve only transferring learned weights, and each shard could have its own training data.
ParallelStructuredPerceptron is an attempt to reimplement StructuredPerceptron in terms of OneEpochPerceptrons. It has n_jobs parameter, and ideally it should use multiprocessing or multithreading (numpy/scipy releases GIL and the bottleneck is in dot product isn't it?) for faster training. But I didn't manage to make multiprocessing work without copying shard's X/y/lengths each iteration, so n_jobs = N just creates N OneEpochPerceptrons and trains them sequentially.
Ideally, I want OneEpochPerceptron to be easy to use with IPython.parallel in distributed environment, and ParallelStructuredPerceptron to be easy to use on single machine.
Issues with current implementation:
- "parallel" part is not implemented in ParallelStructuredPerceptron (I'm not very versed with multiprocessing/joblib/... and I don't know how to make it work without copying training data on each iteration - ideas are welcome);
- code duplication in SequenceShards vs SequenceKFold;
- code duplication in OneEpochPerceptron vs StructuredPerceptron vs ParallelStructuredPerceptron;
- OneEpochPerceptron uses 'transform' method to learn updated weights;
- I don't understand original classes/class_range/n_classes code so maybe I broke something here, and there is also code duplication;
- parameters are mixed uniformly - mixing strategy that takes loss in account is not implemented;
- not sure about class names and code organization.
sequence_ids shuffling method is changed to make ParallelStructuredPerceptron and StructuredPerceptron learn exactly the same weights given the same random_state.
With n_jobs=1 ParallelStructuredPerceptron is about 10% slower than StructuredPerceptron on my data; I think we could join these classes when (and if) ParallelStructuredPerceptron will be ready.