huggingface_hub
huggingface_hub copied to clipboard
Keras callback for pushing models to Hub
The callback to push models to Hub. It's a best of both worlds for transformers Keras callback and push_to_hub_callback(). I didn't write any tests on it yet because I couldn't decide the best way of doing it. I could either:
- Test by interfering to training loop to check if the model is pushed in specific intervals. (which is probably a very very bad idea)
- Check the commit history.
Here are couple of examples (push per epoch to a hub model ID and push by URL).
I don't want a review yet because 1. it's a branch derived from unapproved model card PR branch (I did it intentionally because I needed model card writing functions inside, so many merge conflicts upcoming lol), which needs merger 2. I didn't write any tests. You're free to try it on your notebooks though.
I finished the callback, the rest of the mixin file is not up to date (as it is derived from unapproved Keras model card branch) I needed model card writing files. I added one test case that passes at the moment. HOWEVER the test is about checking if the files are pushed to Hub or not at the end of the training, not about if it's pushed every step or epoch, which is not a good test. We can either check the commit history somehow (?) or I will have to somehow interfere to training loop and check if it's being pushed (which is bad). Also unlike card writing in other cases, I added a flag that checks if callback calls the writing so that it will overwrite. (normally we just read) I'm asking for an early review. @nateraw @osanseviero
I wrote a test that basically at the end of every x, writes last time a commit was made and at the end of training, checks if the frequency matches the length of number of commits. TIL: I should unstage commit, not pushing doesn't solve :D (see I have two commits above :( )
I added tests for step strategy and end of training strategies as well.
I will change the code since currently model card creation is directly called inside save_pretrained_keras thought it doesn't make sense to wait until merger, since it will change many things.
I aligned things with main (that we merged keras_model_card) because it might've later affected final tests. Changes include:
- tensorboard override
- model card creation called in
save_pretrained_kerasI also added override_card arg undersave_pretrained_kerassince card creation was done under it. Did nit picking as well.
All the tests are failing (except for tensorflow tests) because Callback is not being imported, when I try to import it like so in keras_mixin.py, I didn't calculate this because there was no problem on my local.
if is_tf_available():
import tensorflow as tf
from tensorflow.keras.callbacks import Callback
I'll see how I can fix it.
The issue is that ValidationCallback inherits/sub-classes from Callback, which is not imported unless you have TensorFlow
Not super elegant, but I think something along these lines would work. There might be a better approach though
if is_tf_available():
import tensorflow as tf
callback = tf.keras.callbacks.Callback
else:
class Dummy(object):
def __call__(self, *args, **kwargs):
return None
callback = Dummy
class ValidationCallback(callback):
@osanseviero I thought problem was something fundamental because while importing tf is no problem importing Callback is a problem so I'll do it for now but keep looking into it
I fixed a bug in steps strategy
- generate dummy data with np.ones() instead of giving a list
- removed
show_layer_activationsand added tempfile to tests to align with main beforehand - did nit picking I will check once more before I ask for a review.
The test for steps strategy was flaky for when your number of batches was odd and save steps were even so I rounded to floor since that's what happens with leftover batches.
@osanseviero We can return for steps and block for epochs.
I usually prefer consistency since it will confuse users that one method always pushes while the other not. Is there any con for making the epoch one not blocking?
@osanseviero Epochs are taking longer which I think blocking will not cause any change in performance but I'll change now if you think being consistent is better.
Thanks for explaining! Yes, I feel consistency is better, and would do non-blocking which will be appreciated for large models.
When blocking is False the user will see:
remote: error: cannot lock ref 'refs/heads/main': is at 66d0b2cc64c2c1ad77084cac6c6414fdd2636fc6 but expected adb34ff5f99b94c48df5a435815e2399374bb559 To https://huggingface.co/merve/callback-batch-test ! [remote rejected] main -> main (failed to update ref) error: failed to push some refs to '....'
So HEAD according to the refs is at adb34 and there's a mismatch.
I find it a bit confusing for users. I'll try to fix this through updating ref, I also saw pruning could help.
(I keep this like a diary, you don't have to answer)
I'm a bit surprised for this. Do we get same error in the transformers callback? The code is very similar too.
@osanseviero I think this is because length of epochs/steps and state of push, it doesn't happen in transformers (because epochs take long), I tested on my own to make the model I'm testing with bigger and it doesn't appear. This also doesn't affect the pushes (I see everything pushed on the other end and the tests pass), it's just confusing for the users. The models I was testing with previously would take a second or two every step or epoch to train, I think that's why it happened.
Ah I see, this makes sense, thanks for explaining!. I don't think it's a big issue if this would require a significant change in the code to get it working though, but if we can reduce the noise that would be great.
I think another difference between the implementation between here and transformers is that in transformers we specifically wait for the previous push to be finished before starting another one. We retrieve a push_in_progress variable that we check before trying to push again. Could that be the source of the refs mismatch?
@LysandreJik I get the same thing even when I return the last_job from the push. (which is is_done in command output) I don't know if people upload models that are super fast to train every step so I don't know if it's worth updating refs manually inside callback itself. The tests fail because of this reason, I made the models very fast to train so that tests pass fast. 😅 but now I get this. 😄 but I'll put last_job back regardless, thanks for the catch.
This is definitely an edge case 😅
Epoch 2/2
1/1 [==============================] - 0s 3ms/step - loss: 0.5654
I'll just get commit SHA and update ref. it doesn't work.
Removing this PR from the v0.5 milestone as version v0.5 will be released in a bit.
I will come up with a way to test this without messing refs and making model bigger.
Let me preface this by admitting that I have very little experience with keras but here are my thoughts:
As a user, when I would see a callback like this, I would expect it to work similarly to ModelCheckpoint. E.g. there, I have the option to monitor a specific metric like validation loss and only create a checkpoint if that metric improves.
Implementation-wise, I took a look at ModelCheckpoint code to see if it could be used as a base class with a few changes, but it does not encapsulate the i/o part, so this would probably be a bad idea. Still, even if a lot of code would need to be copy-pasted, at least we can have confidence that ModelCheckpoint has a long history of usage and is well tested, allowing PushToHubCallback to focus on the i/o.
@BenjaminBossan thanks for the feedback! This is a really great idea, and I ran into it too when I made a callback for PyTorch Lightning. Almost better off copying the default model checkpoint callback and adding the few lines for pushing somewhere within. would love to hear others thoughts
@merveenoyan I added a feature (this PR) on skorch to save model checkpoints on HF Hub. I took a different approach there which works with existing callbacks instead of writing a new one. Maybe that could be a more elegant solution for keras as well? Take a look at this notebook to see how it works from a user perspective. Whether such an approach would work with keras ModelCheckpoint, I can't say for sure.
This PR by @BenjaminBossan uses upload_file which makes sense to use for this case, then it doesn't mess up git ref (from what I see and my guess) Should I wait until we migrate mixins to that or would you like me to do it soon? @osanseviero @LysandreJik @nateraw
I would wait until #847 is merged, @Wauplin is actively working on this
@merveenoyan I don't know what "soon" means and how urgent is this callback feature, but just an update to say that keras mixins to leverage non-git uploads are on their way (see PR https://github.com/huggingface/huggingface_hub/pull/847). Especially it will be easier to build a callback based on push_to_hub_keras since there will be no need for an initialization with a git pull anymore. See also API example from @LysandreJik in https://github.com/huggingface/huggingface_hub/pull/847#issue-1210196213 .
Yes, I don't think this has been widely requested or urgent, and as waiting will make everyone's life easier I would just wait.