ml Integrate the neural token splitter

Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.

The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.

The splitting should be batched for performance reasons.

Apr 10 '19 17:04 vmarkovtsev

We should extend TokenParser.

Besides, we need to code a benchmark to measure the performance of heuristics vs advanced ML.

Apr 10 '19 17:04 vmarkovtsev

The model from the paper https://drive.google.com/file/d/1-vTJ1Ib-WVETNdmnzMqSW3PaFYlI2gvu/view?usp=sharing

Apr 10 '19 17:04 warenlg

@warenlg I mean the code. I remember that you coded one.

Apr 10 '19 17:04 vmarkovtsev

The code to train the model is here https://github.com/src-d/ml/blob/master/sourced/ml/cmd/train_id_split.py A snippet that loads the model and demo some identifier splitting aka split.py, yes I already gave it to Tristan by DM on slack

Apr 10 '19 18:04 warenlg

Some insights about how it is going:

Tried to train the model using my laptop + eGPU. Failed miserably due to memory usage. I suspect dataset replication in memory because original identifiers dataset is 2.3Gb. Shouldn't use ~28Gb of RAM. I will investigate that memory usage during training on science-3 (just had my credentials).
Adding the model to modelforge is ongoing. The class itself is almost finished. I am not entirely familiar with modelforge so it's a bit slower than it should but I'm starting to understand how it works.
@warenlg gave me his old model weights so I can test my integration into TokenParser while training on science-3 at the same time. @vmarkovtsev

Apr 11 '19 09:04 glimow

@vmarkovtsev @zurk I was able to make the model to work with the modelforge API along with asdf saving and loading support. I have a question regarding tests: since this is an ML model aka not deterministic across different trains, shall we really compare results identifier by identifier or rather evaluate the overall metrics of the model ? Like predicting an ~100 identifiers and ensuring precision > 80% @warenlg this concerns you as well since this is your model

Apr 12 '19 11:04 glimow

We should be able to reproduce the model as precise as possible, bit to bit in the ideal case. That is why we fix all package versions, all random seeds, sort arrays, etc. So we make all train process deterministic. If you see that model is different from one train run to another, let's find out why. In this case, it is totally fine to compare results identifier by identifier. It also helps us to see if our new changes affect model performance.

Apr 12 '19 12:04 zurk

At first, I'd would say, let's reproduce the model with the current training pipeline and the parameters from the paper with the same resources i.e. on 2 GPUs (it should take 1/2 day):

epochs: 10
RNN seq len: 40
batch size: 512
optimizer: Adam
learning rate: 0.001

And if we get the same precision and recall on the overall dataset, update the model in modelforge

Apr 12 '19 13:04 warenlg

ml ml copied to clipboard

Integrate the neural token splitter

ml
ml copied to clipboard