Cache model loading in model card
Follow up to #96 and #205.
The _load_model function
https://github.com/skops-dev/skops/blob/30ddea7015ac67df3f950da88f32fb035f9332d2/skops/card/_model_card.py#L168-L205
is currently called each time when the model of a card is accessed, e.g. when repr is called. When the model is a path, it has to be loaded each time, which can be expensive. Therefore, we would like to add a cache to _load_model.
A simple functools.{lru_cache,cache} is, however, not sufficient. This is because the argument, in this case the model path, could remain the same while the model on the drive is overwritten by a new model. The cache key would need to be something else, like the md5 sum of the file that the path points to.
Hi! I would like to help here. If we were to use the md5 sum of the file that the path points to, then we would have to calculate the md5 sum every time that _load_model is called ?
then we would have to calculate the md5 sum every time that
_load_modelis called ?
Each time when the function is called with a str or Path argument, yes.
I would assume that calculating the md5 should be really fast, but ideally that could be measured before we change the code.
Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.
Actually, it's better to use the md5 sum of the model file rather than loading the model file all the time anyways. I could start a PR on this issue.
Well, if loading the file takes 100 ms and calculating the md5 takes 99 ms, arguably it's better to just always load the file. That's why we should measure, just to be sure.
Instead of md5, we could also consider storing and checking the time the file was last modified using getmtime. Maybe there is some precedence for what the best approach would be.
Okay, I understand. Although, I think that getmtime could make us reload the model file because its modification time changed, but the content of the model file could still be the same, in that case we could compute the hash of file to double check. But again, supposing that hashing with md5 sum is very fast.
Maybe we can use some metadata of the model to check whether it was changed or not.
The idea is not to "never reload a file twice", the idea is to "not reload the same content in most cases". So if a solution would reduce reloading the same file twice for 95% of the cases, that's good enough of a solution, especially if it keeps it simple.
Thanks for clarifying. I think I could do some testing on the loading times and the md5 hash of a model file, and give some feedback on how they compare?
Sure.
Sorry I haven't posted a response to what I said I would test. I will give a response in the next 2-3 days :)
Hey! I have some results to share with you.
I have tested loading times and the time taken to compute the md5sum of the pickle file of the model with some examples. I have used two models to do that. The first model I have used is from examples/plot_model_card.py which uses sklearn.ensemble.HistGradientBoostingClassifier, and the other model I used is sklearn.ensemble.ExtraTreesClassifier. Here are the results:
Example 1 - sklearn.ensemble.HistGradientBoostingClassifier Model size = 228K
time md5sum /tmp/skops-_gor4zib/skops-4e68jjra.pkl
3e58bbfade12f97a9bf1e989ed8ab151 /tmp/skops-_gor4zib/skops-4e68jjra.pkl
real 0m0.003s
Time of _load_model()
0.006422758102416992s
Example 2 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=10000 - number of trees in the forest
Model size = 75M
time md5sum /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
ada4e1ea1fda4a72cfa52ebf6458c557 /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl
real 0m0.097s
Time of _load_model()
0.9678699970245361s
Example 3 - sklearn.ensemble.ExtraTreesClassifier with n_estimators=100000 - number of trees in the forest
Model size = 749M
time md5sum /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl
e693e402475155f07ae13a56dbb82b12 /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl
real 0m0.944s
Time of _load_model()
8.120674848556519s
Thanks a lot for doing the tests. A first conclusion would be that for small models, calling the hashing function is actually as slow as loading the models, only for big models would we notice a difference. But it can still be quite significant, as shown in your last example.
For this test, could you please check what the times would be when calling the hashing functions from Python using hashlib? I would expect that it adds a bit of an overhead. Also, just in case we require a more secure hashing function, could you please check both md5 and, say, sha256? And maybe run the tests a few times using timeit (or %timeit in case you're using ipython/jupyter).
I measured the times with the timeit python module. Here are three sets of results:
Test 1
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0003169089904986322s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11238793900702149s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.088716946018394s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.00016041501658037305s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06882680201670155s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6861052309977822s
Test 2
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0002976770047098398s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.11226405500201508s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.101641189015936s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003235790063627064s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.07127582101384178s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6806159189436585s
Test 3
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl md5 time=0.0009365510195493698s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl md5 time=0.1162097500055097s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl md5 time=1.1024849910172634s
Model 1 (228K) /tmp/skops-_gor4zib/skops-4e68jjra.pkl sha256 time=0.0003798969555646181s
Model 2 (75M) /tmp/skops-gdzd4gj2/skops-456h4_5e.pkl sha256 time=0.06875280395615846s
Model 3 (749M) /tmp/skops-8k741m3s/skops-x3d6hy5w.pkl sha256 time=0.6962733599939384s
It looks like sha256 is faster than md5 from these tests. Let me know if you want me to share the code snippet that I've used.
So I assume the 3 tested models are the same as in your previous post.
If so, do I interpret the data correctly that calling md5 from Python (2nd comment) on the small model is 10x faster (0.0003 sec) than calling it from the command line (1st comment, 0.003 sec)? That's strange. But for the other models, there is no big difference, so I guess we can ignore this outlier.
Anyway, overall it seems like sha256 is the better choice and is consistently faster than actually loading the model, especially for larger models. Then I think it's safe to say we should go ahead and implement the caching.
Yes, your assumption is correct.
I didn't spot that outlier, I tested again from the command line and it gives me the same result.
Okay, then I will open a PR to implement the caching! :)
I just learned about cached_property, which was introduced in Python 3.8. Since we dropped 3.7, maybe this could be an option to use for us.
I think that we wouldn't be able to use cached_property here because it's only applicable to methods
I opened a PR #299 that implements the functionality discussed. However, there's a small problem. I worked on a previous PR #207 which is supposed to be merged, but I worked on this PR in the main branch of my fork (my mistake). Therefore, I had to branch off from main to implement cache model loading and now all my previous commits are attached to #299. I'm sorry if that's an issue.