Movinet hub and source output differ
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ *] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- [ *] I am reporting the issue to the correct repository. (Model Garden official or research directory)
- [ *] I checked to make sure that this issue has not been filed already.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/official/projects/movinet
2. Describe the bug
A minimal code example will follow the bug description.
- Initialized a pretrained MoViNet A0 Stream model from hub: https://tfhub.dev/tensorflow/movinet/a0/stream/kinetics-600/classification/
- Initialized a pretrained model MoViNet A0 from checkpoint: https://storage.googleapis.com/tf_model_garden/vision/movinet/movinet_a0_stream.tar.gz
The logit outputs differ considerably between the two models. I have validated that the model weights in the hub and the checkpoint are the same.
3. Steps to reproduce
Before running the unittest, download and extract the checkpoint:
- wget https://storage.googleapis.com/tf_model_garden/vision/movinet/movinet_a0_stream.tar.gz -O movinet_a0_stream_.tar.gz -q
- tar -xvf movinet_a0_stream_.tar.gz
import unittest
from typing import Tuple, Dict
import tensorflow_hub as hub
import tensorflow as tf
from six.moves import urllib
from io import BytesIO
from PIL import Image
from official.projects.movinet.modeling import movinet
from official.projects.movinet.modeling import movinet_model
import numpy as np
model_id = 'a0'
num_classes = 600
H = W = 172
C = 3
T = 1
bs = 1
dummy_input = tf.random.normal(shape=[bs, T, H, W, 3])
def create_hub_model(model_id) -> Tuple[tf.keras.Model, Dict]:
hub_url = f"https://tfhub.dev/tensorflow/movinet/{model_id}/stream/kinetics-600/classification/"
model_hub = hub.KerasLayer(hub_url)
init_states_fn = model_hub.resolved_object.signatures['init_states']
init_states = init_states_fn(tf.shape(dummy_input))
return model_hub, init_states
def create_local(model_id) -> Tuple[movinet.Movinet, Dict]:
backbone = movinet.Movinet(
model_id=model_id,
causal=True,
conv_type='2plus1d',
se_type='2plus3d',
activation='hard_swish',
gating_activation='hard_sigmoid',
use_positional_encoding=False,
use_external_states=True,
)
backbone.trainable = False
model = movinet_model.MovinetClassifier(
backbone,
num_classes=600,
output_states=True
)
checkpoint_dir = f'movinet_{model_id}_stream'
checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)
checkpoint = tf.train.Checkpoint(model=model)
status = checkpoint.restore(checkpoint_path).expect_partial()
status.assert_existing_objects_matched()
init_states_local = model.init_states(tf.shape(dummy_input))
return model, init_states_local
class MyTestCase(unittest.TestCase):
def test_hub_equal_source(self):
model_hub, states_hub = create_hub_model(model_id)
image_url = 'https://upload.wikimedia.org/wikipedia/commons/8/84/Ski_Famille_-_Family_Ski_Holidays.jpg'
with urllib.request.urlopen(image_url) as f:
image = Image.open(BytesIO(f.read())).resize((H, W))
X = tf.reshape(np.array(image), [1, 1, H, W, 3])
X = tf.cast(X, tf.float32) / 255
y_hub, _ = model_hub({**states_hub, 'image': X})
print(y_hub[0][0:5])
model_local, states_local = create_local(model_id)
y_local, _ = model_local({**states_local, 'image': X})
print(y_local[0][0:5])
tf.debugging.assert_near(y_local, y_hub, atol=1e-3)
if __name__ == '__main__':
unittest.main()
4. Expected behavior
The output logits of the hub model and the checkpoint model should be close. However, they differ considerably.
5. Additional context
Dependencies for the test: numpy Pillow==11.1.0 six==1.17.0 tensorflow[and-cuda]==2.18.1 tensorflow_hub==0.16.1 tf_models_official==2.18.00
6. System information
- OS Platform and Distribution - Ubuntu 22.04.5 LTS
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below):=2.18.1
- Python version: 3.10.12
- CUDA/cuDNN version: cuda_12.8.r12.8
- GPU model and memory: NVIDIA GeForce RTX 4090, 24GB
Below are the two key changes needed:
- Enable Positional Encoding
Original:
use_positional_encoding=False
Update to:
use_positional_encoding=True
- Add Dropout and Drop Connect Rates These parameters are used in the TFHub model and must be explicitly set in the local model to ensure parity.
Update classifier config:
dropout_rate=0.2
Update backbone config:
drop_connect_rate=0.2
@Jiya873 Thank you for the replay.
-
drop_connect_rate is not an input parameter to movinet.Movinet. Did you mean stochastic_depth_drop_rate?
-
Regarding use_positional_encoding, when this parameter is set to True, the checkpoint and the model become incompatible. Note that the test runs the a0 stream model. The following exception is thrown on status.assert_existing_objects_matched():
AssertionError: Found 15 Python objects that were not bound to checkpointed values, likely due to changes in the Python program. Showing 10 of 15 unmatched objects: [<tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>, <tf.Variable 'scale:0' shape=() dtype=float32, numpy=0.0>]
This aligns with the paper and documentation, which state that positional encodings are used in streaming models from model a3 and above, inclusive. The test used a0.
I've tested the suggested changes on the a3 stream version as well, using the same test, and no AssertionError is thrown on assert_existing_objects_matched. However, the output of the hub and source model still does not match.