huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Keras callback for pushing models to Hub

Open merveenoyan opened this issue 3 years ago • 30 comments

The callback to push models to Hub. It's a best of both worlds for transformers Keras callback and push_to_hub_callback(). I didn't write any tests on it yet because I couldn't decide the best way of doing it. I could either:

  1. Test by interfering to training loop to check if the model is pushed in specific intervals. (which is probably a very very bad idea)
  2. Check the commit history.

Here are couple of examples (push per epoch to a hub model ID and push by URL).

I don't want a review yet because 1. it's a branch derived from unapproved model card PR branch (I did it intentionally because I needed model card writing functions inside, so many merge conflicts upcoming lol), which needs merger 2. I didn't write any tests. You're free to try it on your notebooks though.

merveenoyan avatar Feb 24 '22 17:02 merveenoyan

I finished the callback, the rest of the mixin file is not up to date (as it is derived from unapproved Keras model card branch) I needed model card writing files. I added one test case that passes at the moment. HOWEVER the test is about checking if the files are pushed to Hub or not at the end of the training, not about if it's pushed every step or epoch, which is not a good test. We can either check the commit history somehow (?) or I will have to somehow interfere to training loop and check if it's being pushed (which is bad). Also unlike card writing in other cases, I added a flag that checks if callback calls the writing so that it will overwrite. (normally we just read) I'm asking for an early review. @nateraw @osanseviero

merveenoyan avatar Feb 25 '22 16:02 merveenoyan

I wrote a test that basically at the end of every x, writes last time a commit was made and at the end of training, checks if the frequency matches the length of number of commits. TIL: I should unstage commit, not pushing doesn't solve :D (see I have two commits above :( )

merveenoyan avatar Mar 02 '22 13:03 merveenoyan

I added tests for step strategy and end of training strategies as well.

merveenoyan avatar Mar 04 '22 16:03 merveenoyan

I will change the code since currently model card creation is directly called inside save_pretrained_keras thought it doesn't make sense to wait until merger, since it will change many things.

merveenoyan avatar Mar 07 '22 10:03 merveenoyan

I aligned things with main (that we merged keras_model_card) because it might've later affected final tests. Changes include:

  • tensorboard override
  • model card creation called in save_pretrained_keras I also added override_card arg under save_pretrained_keras since card creation was done under it. Did nit picking as well.

merveenoyan avatar Mar 07 '22 12:03 merveenoyan

All the tests are failing (except for tensorflow tests) because Callback is not being imported, when I try to import it like so in keras_mixin.py, I didn't calculate this because there was no problem on my local.

if is_tf_available():
    import tensorflow as tf
    from tensorflow.keras.callbacks import Callback

I'll see how I can fix it.

merveenoyan avatar Mar 07 '22 13:03 merveenoyan

The issue is that ValidationCallback inherits/sub-classes from Callback, which is not imported unless you have TensorFlow

osanseviero avatar Mar 07 '22 21:03 osanseviero

Not super elegant, but I think something along these lines would work. There might be a better approach though

if is_tf_available():
    import tensorflow as tf
    callback = tf.keras.callbacks.Callback
else:
    class Dummy(object):
        def __call__(self, *args, **kwargs):
            return None

    callback = Dummy

class ValidationCallback(callback):

osanseviero avatar Mar 07 '22 21:03 osanseviero

@osanseviero I thought problem was something fundamental because while importing tf is no problem importing Callback is a problem so I'll do it for now but keep looking into it

merveenoyan avatar Mar 08 '22 08:03 merveenoyan

I fixed a bug in steps strategy

  • generate dummy data with np.ones() instead of giving a list
  • removed show_layer_activations and added tempfile to tests to align with main beforehand
  • did nit picking I will check once more before I ask for a review.

merveenoyan avatar Mar 14 '22 17:03 merveenoyan

The test for steps strategy was flaky for when your number of batches was odd and save steps were even so I rounded to floor since that's what happens with leftover batches.

merveenoyan avatar Mar 14 '22 18:03 merveenoyan

@osanseviero We can return for steps and block for epochs.

merveenoyan avatar Mar 17 '22 14:03 merveenoyan

I usually prefer consistency since it will confuse users that one method always pushes while the other not. Is there any con for making the epoch one not blocking?

osanseviero avatar Mar 18 '22 08:03 osanseviero

@osanseviero Epochs are taking longer which I think blocking will not cause any change in performance but I'll change now if you think being consistent is better.

merveenoyan avatar Mar 18 '22 15:03 merveenoyan

Thanks for explaining! Yes, I feel consistency is better, and would do non-blocking which will be appreciated for large models.

osanseviero avatar Mar 21 '22 09:03 osanseviero

When blocking is False the user will see: remote: error: cannot lock ref 'refs/heads/main': is at 66d0b2cc64c2c1ad77084cac6c6414fdd2636fc6 but expected adb34ff5f99b94c48df5a435815e2399374bb559 To https://huggingface.co/merve/callback-batch-test ! [remote rejected] main -> main (failed to update ref) error: failed to push some refs to '....' So HEAD according to the refs is at adb34 and there's a mismatch. I find it a bit confusing for users. I'll try to fix this through updating ref, I also saw pruning could help. (I keep this like a diary, you don't have to answer)

merveenoyan avatar Mar 22 '22 12:03 merveenoyan

I'm a bit surprised for this. Do we get same error in the transformers callback? The code is very similar too.

osanseviero avatar Mar 23 '22 11:03 osanseviero

@osanseviero I think this is because length of epochs/steps and state of push, it doesn't happen in transformers (because epochs take long), I tested on my own to make the model I'm testing with bigger and it doesn't appear. This also doesn't affect the pushes (I see everything pushed on the other end and the tests pass), it's just confusing for the users. The models I was testing with previously would take a second or two every step or epoch to train, I think that's why it happened.

merveenoyan avatar Mar 23 '22 13:03 merveenoyan

Ah I see, this makes sense, thanks for explaining!. I don't think it's a big issue if this would require a significant change in the code to get it working though, but if we can reduce the noise that would be great.

osanseviero avatar Mar 23 '22 13:03 osanseviero

I think another difference between the implementation between here and transformers is that in transformers we specifically wait for the previous push to be finished before starting another one. We retrieve a push_in_progress variable that we check before trying to push again. Could that be the source of the refs mismatch?

LysandreJik avatar Mar 24 '22 09:03 LysandreJik

@LysandreJik I get the same thing even when I return the last_job from the push. (which is is_done in command output) I don't know if people upload models that are super fast to train every step so I don't know if it's worth updating refs manually inside callback itself. The tests fail because of this reason, I made the models very fast to train so that tests pass fast. 😅 but now I get this. 😄 but I'll put last_job back regardless, thanks for the catch. This is definitely an edge case 😅

Epoch 2/2
1/1 [==============================] - 0s 3ms/step - loss: 0.5654

I'll just get commit SHA and update ref. it doesn't work.

merveenoyan avatar Mar 24 '22 09:03 merveenoyan

Removing this PR from the v0.5 milestone as version v0.5 will be released in a bit.

LysandreJik avatar Apr 05 '22 12:04 LysandreJik

I will come up with a way to test this without messing refs and making model bigger.

merveenoyan avatar May 09 '22 12:05 merveenoyan

Let me preface this by admitting that I have very little experience with keras but here are my thoughts:

As a user, when I would see a callback like this, I would expect it to work similarly to ModelCheckpoint. E.g. there, I have the option to monitor a specific metric like validation loss and only create a checkpoint if that metric improves.

Implementation-wise, I took a look at ModelCheckpoint code to see if it could be used as a base class with a few changes, but it does not encapsulate the i/o part, so this would probably be a bad idea. Still, even if a lot of code would need to be copy-pasted, at least we can have confidence that ModelCheckpoint has a long history of usage and is well tested, allowing PushToHubCallback to focus on the i/o.

BenjaminBossan avatar Jul 11 '22 10:07 BenjaminBossan

@BenjaminBossan thanks for the feedback! This is a really great idea, and I ran into it too when I made a callback for PyTorch Lightning. Almost better off copying the default model checkpoint callback and adding the few lines for pushing somewhere within. would love to hear others thoughts

nateraw avatar Jul 11 '22 16:07 nateraw

@merveenoyan I added a feature (this PR) on skorch to save model checkpoints on HF Hub. I took a different approach there which works with existing callbacks instead of writing a new one. Maybe that could be a more elegant solution for keras as well? Take a look at this notebook to see how it works from a user perspective. Whether such an approach would work with keras ModelCheckpoint, I can't say for sure.

BenjaminBossan avatar Aug 04 '22 14:08 BenjaminBossan

This PR by @BenjaminBossan uses upload_file which makes sense to use for this case, then it doesn't mess up git ref (from what I see and my guess) Should I wait until we migrate mixins to that or would you like me to do it soon? @osanseviero @LysandreJik @nateraw

merveenoyan avatar Aug 08 '22 12:08 merveenoyan

I would wait until #847 is merged, @Wauplin is actively working on this

osanseviero avatar Aug 08 '22 14:08 osanseviero

@merveenoyan I don't know what "soon" means and how urgent is this callback feature, but just an update to say that keras mixins to leverage non-git uploads are on their way (see PR https://github.com/huggingface/huggingface_hub/pull/847). Especially it will be easier to build a callback based on push_to_hub_keras since there will be no need for an initialization with a git pull anymore. See also API example from @LysandreJik in https://github.com/huggingface/huggingface_hub/pull/847#issue-1210196213 .

Wauplin avatar Aug 08 '22 14:08 Wauplin

Yes, I don't think this has been widely requested or urgent, and as waiting will make everyone's life easier I would just wait.

osanseviero avatar Aug 08 '22 15:08 osanseviero