FlipReID icon indicating copy to clipboard operation
FlipReID copied to clipboard

train & test

Open FatemehAnvari opened this issue 1 year ago • 13 comments

Hello and thank you for your effort I want to train and test a model using the embedded code But in both cases, it gives me the following error:

Screenshot 2024-08-19 141631

Can you guide me?

FatemehAnvari avatar Aug 19 '24 10:08 FatemehAnvari

Thank you for your interest in our work. To resolve the error, please ensure you've correctly downloaded the Market-1501 dataset. Follow the instructions here to obtain the complete dataset. Once downloaded, verify that the 'bounding_box_train' folder contains 12936 .jpg files, 'bounding_box_test' contains 19732 .jpg files, and 'query' contains 3368 .jpg files.

nixingyang avatar Aug 19 '24 11:08 nixingyang

Thank you for your quick and effective reply I fixed the previous problem, but now I have a problem with the model Thank you for your guidance

Screenshot 2024-08-21 121552

FatemehAnvari avatar Aug 21 '24 08:08 FatemehAnvari

raise ValueError( ValueError: Layer count mismatch when loading weights from file. Model expected 0 layers, found 2 saved layers.

FatemehAnvari avatar Aug 21 '24 08:08 FatemehAnvari

It might be a compatibility issue between your setup (Python 3.12 and likely the latest TensorFlow) and the code's original environment (TensorFlow 2.2.3 and Python 3.8). Here are two approaches to address this:

  • Use the code's original environment. Refer to the instructions here. However, this approach might not be feasible for newer GPUs.
  • Update the repository dependencies. Consider updating the repository to support the latest TensorFlow. You may train your own models as long as the final performance reflects the reported results.

nixingyang avatar Aug 21 '24 09:08 nixingyang

thank you very much. I installed tensorflow 2.14 and python 3.10.12. The program has been executed up to the following amount But it also gave an error that I don't understand where the problem comes from I also think that the network parameters are a bit illogical, don't you think so?

And please advise about the error

Model: "training_model"


Layer (type) Output Shape Param # Connected to

input_11 (InputLayer) [(None, 384, 128, 3)] 0 []

inference_model (Functiona [(None, 2048), 7629920 ['input_11[0][0]',
l) (None, 1024), 0 'tf.image.flip_left_right[0][ (None, 1024)] 0]']

tf.image.flip_left_right ( (None, 384, 128, 3) 0 ['input_11[0][0]']
TFOpLambda)

tf.operators.add (TFOp (None, 2048) 0 ['inference_model[0][0]']
Lambda)

tf.operators.add_2 (TF (None, 1024) 0 ['inference_model[0][1]']
OpLambda)

tf.operators.add_4 (TF (None, 1024) 0 ['inference_model[0][2]']
OpLambda)

tf.operators.add_1 (TF (None, 2048) 0 ['tf.operators.add[0][0]', OpLambda) 'inference_model[1][0]']

tf.operators.add_3 (TF (None, 1024) 0 ['tf.operators.add_2[0][0] OpLambda) ',
'inference_model[1][1]']

tf.operators.add_5 (TF (None, 1024) 0 ['tf.operators.add_4[0][0] OpLambda) ',
'inference_model[1][2]']

tf.math.truediv_2 (TFOpLam (None, 2048) 0 ['tf.operators.add_1[0][0] bda) ']

tf.math.truediv_3 (TFOpLam (None, 1024) 0 ['tf.operators.add_3[0][0] bda) ']

tf.math.truediv_4 (TFOpLam (None, 1024) 0 ['tf.operators.add_5[0][0] bda) ']

classification_model (Func [(None, 751), 3092480 ['tf.math.truediv_2[0][0]',
tional) (None, 751), 'tf.math.truediv_3[0][0]',
(None, 751)] 'tf.math.truediv_4[0][0]']

tf.convert_to_tensor (TFOp (None, 2048) 0 ['inference_model[1][0]']
Lambda)

tf.cast (TFOpLambda) (None, 2048) 0 ['inference_model[0][0]']

tf.math.squared_difference (None, 2048) 0 ['tf.convert_to_tensor[0][0]', (TFOpLambda) 'tf.cast[0][0]']

tf.convert_to_tensor_1 (TF (None, 1024) 0 ['inference_model[1][1]']
OpLambda)

tf.cast_1 (TFOpLambda) (None, 1024) 0 ['inference_model[0][1]']

tf.math.reduce_mean_3 (TFO (None,) 0 ['tf.math.squared_difference[0 pLambda) ][0]']

tf.math.squared_difference (None, 1024) 0 ['tf.convert_to_tensor_1[0][0] _1 (TFOpLambda) ',
'tf.cast_1[0][0]']

tf.convert_to_tensor_2 (TF (None, 1024) 0 ['inference_model[1][2]']
OpLambda)

tf.cast_2 (TFOpLambda) (None, 1024) 0 ['inference_model[0][2]']

tf.math.reduce_mean_4 (TFO () 0 ['tf.math.reduce_mean_3[0][0]' pLambda) ]

tf.math.reduce_mean_5 (TFO (None,) 0 ['tf.math.squared_difference_1 pLambda) [0][0]']

tf.math.squared_difference (None, 1024) 0 ['tf.convert_to_tensor_2[0][0] _2 (TFOpLambda) ',
'tf.cast_2[0][0]']

tf.operators.add_6 (TF () 0 ['tf.math.reduce_mean_4[0][0]' OpLambda) ]

tf.math.reduce_mean_6 (TFO () 0 ['tf.math.reduce_mean_5[0][0]' pLambda) ]

tf.math.reduce_mean_7 (TFO (None,) 0 ['tf.math.squared_difference_2 pLambda) [0][0]']

tf.operators.add_7 (TF () 0 ['tf.operators.add_6[0][0] OpLambda) ',
'tf.math.reduce_mean_6[0][0]' ]

tf.math.reduce_mean_8 (TFO () 0 ['tf.math.reduce_mean_7[0][0]' pLambda) ]

tf.operators.add_8 (TF () 0 ['tf.operators.add_7[0][0] OpLambda) ',
'tf.math.reduce_mean_8[0][0]' ]

add_metric (AddMetric) () 0 ['tf.operators.add_8[0][0] ']

tf.math.multiply (TFOpLamb () 0 ['tf.operators.add_8[0][0] da) ']

add_loss (AddLoss) () 0 ['tf.math.multiply[0][0]']

================================================================================================== Total params: 79391680 (302.86 MB) Trainable params: 40835072 (155.77 MB) Non-trainable params: 38556608 (147.08 MB)


Summarizing inference_model_132322870669664 ... Model: "inference_model"


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) [(None, 384, 128, 3)] 0 []

preprocess_input (Function (None, 384, 128, 3) 0 ['input_1[0][0]']
al)

features/init_block (Funct (None, 96, 32, 64) 9664 ['preprocess_input[0][0]']
ional)

features/stage1 (Functiona (None, 96, 32, 256) 218624 ['features/init_block[0][0]'] l)

features/stage2 (Functiona (None, 48, 16, 512) 1226752 ['features/stage1[0][0]']
l)

features/stage3 (Functiona (None, 24, 8, 1024) 7118848 ['features/stage2[0][0]']
l)

features/stage4_regional_b (None, 24, 8, 2048) 1498726 ['features/stage3[0][0]']
ranch (Functional) 4

lambda_1 (Lambda) (None, 12, 8, 2048) 0 ['features/stage4_regional_bra nch[0][0]']

lambda_3 (Lambda) (None, 12, 8, 2048) 0 ['features/stage4_regional_bra nch[0][0]']

conv2d (Conv2D) (None, 12, 8, 1024) 1887539 ['lambda_1[0][0]']
2

conv2d_1 (Conv2D) (None, 12, 8, 1024) 1887539 ['lambda_3[0][0]']
2

features/stage4_global_bra (None, 24, 8, 2048) 1498726 ['features/stage3[0][0]']
nch (Functional) 4

activation (Activation) (None, 12, 8, 1024) 0 ['conv2d[0][0]']

activation_1 (Activation) (None, 12, 8, 1024) 0 ['conv2d_1[0][0]']

tf.math.maximum (TFOpLambd (None, 24, 8, 2048) 0 ['features/stage4_global_branc a) h[0][0]']

tf.math.maximum_1 (TFOpLam (None, 12, 8, 1024) 0 ['activation[0][0]']
bda)

tf.math.maximum_2 (TFOpLam (None, 12, 8, 1024) 0 ['activation_1[0][0]']
bda)

tf.math.pow (TFOpLambda) (None, 24, 8, 2048) 0 ['tf.math.maximum[0][0]']

tf.math.pow_2 (TFOpLambda) (None, 12, 8, 1024) 0 ['tf.math.maximum_1[0][0]']

tf.math.pow_4 (TFOpLambda) (None, 12, 8, 1024) 0 ['tf.math.maximum_2[0][0]']

tf.math.reduce_mean (TFOpL (None, 2048) 0 ['tf.math.pow[0][0]']
ambda)

tf.math.reduce_mean_1 (TFO (None, 1024) 0 ['tf.math.pow_2[0][0]']
pLambda)

tf.math.reduce_mean_2 (TFO (None, 1024) 0 ['tf.math.pow_4[0][0]']
pLambda)

tf.math.pow_1 (TFOpLambda) (None, 2048) 0 ['tf.math.reduce_mean[0][0]']

tf.math.pow_3 (TFOpLambda) (None, 1024) 0 ['tf.math.reduce_mean_1[0][0]' ]

tf.math.pow_5 (TFOpLambda) (None, 1024) 0 ['tf.math.reduce_mean_2[0][0]' ]

lambda (Lambda) (None, 2048) 0 ['tf.math.pow_1[0][0]']

lambda_2 (Lambda) (None, 1024) 0 ['tf.math.pow_3[0][0]']

lambda_4 (Lambda) (None, 1024) 0 ['tf.math.pow_5[0][0]']

================================================================================================== Total params: 76299200 (291.06 MB) Trainable params: 37750784 (144.01 MB) Non-trainable params: 38548416 (147.05 MB)


Summarizing preprocess_input_132323024335632 ... Model: "preprocess_input"


Layer (type) Output Shape Param #

input_2 (InputLayer) [(None, 384, 128, 3)] 0

tf.math.truediv (TFOpLambd (None, 384, 128, 3) 0
a)

tf.nn.bias_add (TFOpLambda (None, 384, 128, 3) 0
)

tf.math.truediv_1 (TFOpLam (None, 384, 128, 3) 0
bda)

================================================================= Total params: 0 (0.00 Byte) Trainable params: 0 (0.00 Byte) Non-trainable params: 0 (0.00 Byte)


Summarizing features/init_block_132323154296304 ... Model: "features/init_block"


Layer (type) Output Shape Param #

input_3 (InputLayer) [(None, 384, 128, 3)] 0

conv (ConvBlock) (None, 192, 64, 64) 9664

pool (MaxPool2d) (None, 96, 32, 64) 0

================================================================= Total params: 9664 (37.75 KB) Trainable params: 0 (0.00 Byte) Non-trainable params: 9664 (37.75 KB)


Summarizing features/stage1_132323035711712 ... Model: "features/stage1"


Layer (type) Output Shape Param #

input_4 (InputLayer) [(None, 96, 32, 64)] 0

stage1/unit_0_1 (ResUnit) (None, 96, 32, 256) 76288

stage1/unit_0_2 (ResUnit) (None, 96, 32, 256) 71168

stage1/unit_0_3 (ResUnit) (None, 96, 32, 256) 71168

================================================================= Total params: 218624 (854.00 KB) Trainable params: 0 (0.00 Byte) Non-trainable params: 218624 (854.00 KB)


Summarizing features/stage2_132323040862960 ... Model: "features/stage2"


Layer (type) Output Shape Param #

input_5 (InputLayer) [(None, 96, 32, 256)] 0

stage2/unit_1_1 (ResUnit) (None, 48, 16, 512) 381952

stage2/unit_1_2 (ResUnit) (None, 48, 16, 512) 281600

stage2/unit_1_3 (ResUnit) (None, 48, 16, 512) 281600

stage2/unit_1_4 (ResUnit) (None, 48, 16, 512) 281600

================================================================= Total params: 1226752 (4.68 MB) Trainable params: 0 (0.00 Byte) Non-trainable params: 1226752 (4.68 MB)


Summarizing features/stage3_132323037229984 ... Model: "features/stage3"


Layer (type) Output Shape Param #

input_6 (InputLayer) [(None, 48, 16, 512)] 0

stage3/unit_2_1 (ResUnit) (None, 24, 8, 1024) 1517568

stage3/unit_2_2 (ResUnit) (None, 24, 8, 1024) 1120256

stage3/unit_2_3 (ResUnit) (None, 24, 8, 1024) 1120256

stage3/unit_2_4 (ResUnit) (None, 24, 8, 1024) 1120256

stage3/unit_2_5 (ResUnit) (None, 24, 8, 1024) 1120256

stage3/unit_2_6 (ResUnit) (None, 24, 8, 1024) 1120256

================================================================= Total params: 7118848 (27.16 MB) Trainable params: 0 (0.00 Byte) Non-trainable params: 7118848 (27.16 MB)


Summarizing features/stage4_regional_branch_132323156030704 ... Model: "features/stage4_regional_branch"


Layer (type) Output Shape Param #

input_7 (InputLayer) [(None, 24, 8, 1024)] 0

unit_3_1 (ResUnit) (None, 24, 8, 2048) 6049792

unit_3_2 (ResUnit) (None, 24, 8, 2048) 4468736

unit_3_3 (ResUnit) (None, 24, 8, 2048) 4468736

================================================================= Total params: 14987264 (57.17 MB) Trainable params: 0 (0.00 Byte) Non-trainable params: 14987264 (57.17 MB)


Summarizing features/stage4_global_branch_132323034999696 ... Model: "features/stage4_global_branch"


Layer (type) Output Shape Param #

input_7 (InputLayer) [(None, 24, 8, 1024)] 0

unit_3_1 (ResUnit) (None, 24, 8, 2048) 6049792

unit_3_2 (ResUnit) (None, 24, 8, 2048) 4468736

unit_3_3 (ResUnit) (None, 24, 8, 2048) 4468736

================================================================= Total params: 14987264 (57.17 MB) Trainable params: 0 (0.00 Byte) Non-trainable params: 14987264 (57.17 MB)


Summarizing classification_model_132322843232480 ... Model: "classification_model"


Layer (type) Output Shape Param # Connected to

input_8 (InputLayer) [(None, 2048)] 0 []

input_9 (InputLayer) [(None, 1024)] 0 []

input_10 (InputLayer) [(None, 1024)] 0 []

batch_normalization (Batch (None, 2048) 8192 ['input_8[0][0]']
Normalization)

batch_normalization_1 (Bat (None, 1024) 4096 ['input_9[0][0]']
chNormalization)

batch_normalization_2 (Bat (None, 1024) 4096 ['input_10[0][0]']
chNormalization)

dense (Dense) (None, 751) 1538048 ['batch_normalization[0][0]']

dense_1 (Dense) (None, 751) 769024 ['batch_normalization_1[0][0]' ]

dense_2 (Dense) (None, 751) 769024 ['batch_normalization_2[0][0]' ]

activation_2 (Activation) (None, 751) 0 ['dense[0][0]']

activation_3 (Activation) (None, 751) 0 ['dense_1[0][0]']

activation_4 (Activation) (None, 751) 0 ['dense_2[0][0]']

================================================================================================== Total params: 3092480 (11.80 MB) Trainable params: 3084288 (11.77 MB) Non-trainable params: 8192 (32.00 KB)


Traceback (most recent call last): File "/content/FlipReID/solution.py", line 925, in app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/FlipReID/solution.py", line 789, in main visualize_model(model=training_model, output_folder_path=output_folder_path) File "/content/FlipReID/utils/vis_utils.py", line 23, in visualize_model item for item in model._layers # pylint: disable=protected-access AttributeError: 'Functional' object has no attribute '_layers'. Did you mean: 'layers'?

FatemehAnvari avatar Aug 24 '24 09:08 FatemehAnvari

  • AttributeError: 'Functional' object has no attribute '_layers'. Did you mean: 'layers'? is caused by the differences in TensorFlow. The provided code uses TensorFlow 2.2.3, rather than TensorFlow 2.14. Could you try using an environment the same as this?
  • visualize_model would plot the model, and it is not essential. You may comment out this line and check whether the program runs afterward.

nixingyang avatar Aug 24 '24 09:08 nixingyang

I did what you said It is running for the first epoch, but it is only running for the first epoch for about 1 hour and 40 minutes without any results Is it normal?

Screenshot 2024-08-24 152312

FatemehAnvari avatar Aug 24 '24 11:08 FatemehAnvari

  • 1 hour and 40 minutes sounds too long if steps_per_epoch is set to 200.
  • I remember that the training procedure is very efficient, and the GPU utilization rate should be around 100% most of the time. You may check the output of nvidia-smi.

nixingyang avatar Aug 24 '24 19:08 nixingyang

I checked It doesn't seem to use gpu, although it can recognize that gpu is available. torch.cuda.is_available():True torch.device:cuda Initiating the image augmentor ... Perform training ... Freeze layers in the backbone model for 20 epochs.

Epoch 1: LearningRateScheduler setting learning rate to 2e-06. Epoch 1/20 WARNING:tensorflow:From /usr/local/lib/python3.10/site-packages/tensorflow/python/util/deprecation.py:660: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead W0827 09:33:01.223699 136109813311296 deprecation.py:50] From /usr/local/lib/python3.10/site-packages/tensorflow/python/util/deprecation.py:660: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead 2024-08-27 09:33:13.213882: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 201326592 exceeds 10% of free system memory. 2024-08-27 09:33:13.590387: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 201326592 exceeds 10% of free system memory. 2024-08-27 09:33:14.429463: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 209780736 exceeds 10% of free system memory. 2024-08-27 09:33:14.468940: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 209780736 exceeds 10% of free system memory. 2024-08-27 09:33:14.759357: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 201326592 exceeds 10% of free system memory.

Screenshot 2024-08-27 130729

FatemehAnvari avatar Aug 27 '24 09:08 FatemehAnvari

Can you provide compatibility in updated versions for the installed packages?

FatemehAnvari avatar Aug 27 '24 09:08 FatemehAnvari

Due to current workload constraints, I'm unable to update the repository for the latest TensorFlow version at this time. The repository is provided in its current state. However, I'm committed to assisting you with any questions or issues you may encounter. Before starting experiments with FlipReID, please verify that your environment is set up correctly. A straightforward way to do this is to try a simpler example like MNIST. This will help confirm that your GPU is working and TensorFlow is installed properly. Once your environment is ready, you have two options:

  • Use the recommended environment. The repository should work out of the box. Keep in mind that older TensorFlow versions may have compatibility limitations with newer GPUs due to dependencies like CUDA.
  • Update the repository and use the latest TensorFlow. This shouldn't be too complicated. However, you might not be able to load the pre-trained weights I've provided. If this is the case, you can train the models from scratch.

nixingyang avatar Aug 27 '24 10:08 nixingyang

Hello, dear engineer I came again with a new question I managed to run the code as far as traning A and traning B, but when saving the best model I get the following error:

Epoch 1: test_cosine_False_mAP_score improved from -inf to 0.03133, saving model to /content/FlipReID/output/Market1501_resnet50/training_model.h5 /usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py:3103: UserWarning: You are saving your model as an HDF5 file via model.save(). This file format is considered legacy. We recommend using instead the native Keras format, e.g. model.save('my_model.keras'). saving_api.save_model( Traceback (most recent call last): File "/content/FlipReID/solution.py", line 925, in app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/FlipReID/solution.py", line 903, in main training_model.fit(x=train_generator, File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) ValueError: Unable to serialize VariableSpec(shape=(), dtype=tf.float32, trainable=True, alias_id=None) to JSON, because the TypeSpec class <class 'tensorflow.python.ops.resource_variable_ops.VariableSpec'> has not been registered.

FatemehAnvari avatar Aug 31 '24 09:08 FatemehAnvari

  • You can follow the instructions in the log and try model.save('my_model.keras') instead of model.save().
  • You can search for TypeSpec class <class 'tensorflow.python.ops.resource_variable_ops.VariableSpec'> has not been registered in the TensorFlow repository. This issue is again due to the newer version of TensorFlow.

nixingyang avatar Aug 31 '24 11:08 nixingyang