ml4ir icon indicating copy to clipboard operation
ml4ir copied to clipboard

Another test harness, raised from the dead

Open jakemannix opened this issue 4 years ago • 1 comments

TL;DR: took an old branch with some test code, updated it with master, trained a new dummy ranking model in python, wired that into the jvm test harness, and it is not crashing, and in fact getting scores within 1% of each other from python / scala (usually much closer than that).

@lastmansleeping pointed me to an old branch we'd been using in preparation for the Activate talk, in which we really wanted to know for sure that scala/python were behaving the same. I had written some code then to read the model_predictions.csv file and compare, but then we never cleaned it up for review.

What happens on this branch: if you take the feature_config in jvm/ml4ir-inference/src/test/resources/ranking_happy_path and put it in python/data, and train a simple ranking model with it:

docker-compose run ml4ir python3 ml4ir/applications/ranking/pipeline.py --data_dir ml4ir/applications/ranking/tests/data/tfrecord --feature_config ml4ir/applications/ranking/tests/data/configs/feature_config.yaml --run_id test_with_list_loss --data_format tfrecord --execution_mode train_inference_evaluate --loss_key rank_one_listnet

this will emit a model into python/models/test_with_list_loss and predictions into python/logs/test_with_list_loss. These have been copied into (and checked in) in jvm/ml4ir-inference/src/test/resources/ranking_happy_path where they are used in TensorFlowInferenceTest.testSavedModelBundleWithCSVData. This test runs all 1500 examples from the training data csv file though scala inference, and checks to see that the scores are within 0.01 of the same. Most are within 1e-6, but a few are less precise (not sure why yet).

jakemannix avatar Apr 29 '21 08:04 jakemannix

Most are within 1e-6, but a few are less precise (not sure why yet).

@jakemannix The beauty of this setup is that I can swap the test model with a far more complex (more features) L3 model like the one we have for Account, say, and verify if the parity gets worse. I will do this and update here. At this point I think it could very well be due to the java-c++-python floating point conversions.

lastmansleeping avatar Apr 29 '21 08:04 lastmansleeping