ml
ml copied to clipboard
Integrate the neural token splitter
Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.
The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.
The splitting should be batched for performance reasons.
We should extend TokenParser
.
Besides, we need to code a benchmark to measure the performance of heuristics vs advanced ML.
The model from the paper https://drive.google.com/file/d/1-vTJ1Ib-WVETNdmnzMqSW3PaFYlI2gvu/view?usp=sharing
@warenlg I mean the code. I remember that you coded one.
The code to train the model is here https://github.com/src-d/ml/blob/master/sourced/ml/cmd/train_id_split.py
A snippet that loads the model and demo some identifier splitting aka split.py
, yes I already gave it to Tristan by DM on slack
Some insights about how it is going:
- Tried to train the model using my laptop + eGPU. Failed miserably due to memory usage. I suspect dataset replication in memory because original identifiers dataset is 2.3Gb. Shouldn't use ~28Gb of RAM. I will investigate that memory usage during training on science-3 (just had my credentials).
- Adding the model to
modelforge
is ongoing. The class itself is almost finished. I am not entirely familiar withmodelforge
so it's a bit slower than it should but I'm starting to understand how it works. - @warenlg gave me his old model weights so I can test my integration into
TokenParser
while training on science-3 at the same time. @vmarkovtsev
@vmarkovtsev @zurk I was able to make the model to work with the modelforge
API along with asdf saving and loading support.
I have a question regarding tests: since this is an ML model aka not deterministic across different trains, shall we really compare results identifier by identifier or rather evaluate the overall metrics of the model ? Like predicting an ~100 identifiers and ensuring precision > 80%
@warenlg this concerns you as well since this is your model
We should be able to reproduce the model as precise as possible, bit to bit in the ideal case. That is why we fix all package versions, all random seeds, sort arrays, etc. So we make all train process deterministic. If you see that model is different from one train run to another, let's find out why. In this case, it is totally fine to compare results identifier by identifier. It also helps us to see if our new changes affect model performance.
At first, I'd would say, let's reproduce the model with the current training pipeline and the parameters from the paper with the same resources i.e. on 2 GPUs (it should take 1/2 day):
- epochs: 10
- RNN seq len: 40
- batch size: 512
- optimizer: Adam
- learning rate: 0.001
And if we get the same precision and recall on the overall dataset, update the model in modelforge