Hi ,i get the error msg like this :
2022-10-12 15:43:57.254005: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a finalprefetch` in the input pipeline to which to introduce slack.
I1012 15:43:57.996680 140468541171456 api.py:459] train_step begins...
I1012 15:44:07.279798 140468532778752 api.py:459] train_step begins...
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:10.852259 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:17.169317 140468541171456 api.py:446] Trainable variables:
I1012 15:44:17.426999 140468541171456 api.py:446] vit/stem_conv/kernel:0 (16, 16, 3, 768)
I1012 15:44:17.432081 140468541171456 api.py:446] vit/stem_conv/bias:0 (768,)
I1012 15:44:17.436969 140468541171456 api.py:446] vit/stem_ln/gamma:0 (768,)
....
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:31.484436 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:37.695064 140468532778752 api.py:459] train_step ends...
I1012 15:44:38.920633 140468541171456 api.py:459] train_step ends...
2022-10-12 15:45:08.671253: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 351529
Traceback (most recent call last):
File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call ret = func(*args)
File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper return func(*args, **kwargs)
File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)
File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
KeyError: 351529 2022-10-12 15:45:08.671413: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 415619 Traceback (most recent call last):
File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call ret = func(*args)
File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper return func(*args, **kwargs)
File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)
File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
KeyError: 415619
`
My gpu is 2 * RTX 3070 with 8G .
Is the GPU memory too small ?
This looks like some data issue as the complaint was about a keyerror probably related to image id.
On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.***> wrote:
Is the GPU memory too small ?
— Reply to this email directly, view it on GitHub https://github.com/google-research/pix2seq/issues/19#issuecomment-1275754593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
It is the annoantions error. I reload the annoations to solve the error. But the new error likes :
W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias
My tf==2.10.0
This looks like some data issue as the complaint was about a keyerror probably related to image id. … On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>
this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match.
On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.***> wrote:
It is the annoantions error. I reload the annoations to solve the error. But the new error likes :
W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias
My tf==2.10.0
This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m_2240461384712268694_> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment) https://github.com/google-research/pix2seq/issues/19#issuecomment-1275754593>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>
— Reply to this email directly, view it on GitHub https://github.com/google-research/pix2seq/issues/19#issuecomment-1281693824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.***>
this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match. … On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.> wrote: It is the annoantions error. I reload the annoations to solve the error. But the new error likes : W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias My tf==2.10.0 This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m_2240461384712268694_> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment) <#19 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.**>
I git clone the repo and did not change anything. Which version of TF are you using? I put the Object365 checkpoints into model_dir and the command likes
python3 run.py --mode=train --model_dir=/data/c/Objects365-vitb-640/ --config=configs/config_det_finetune.py --config.dataset.data_dir=/data/c/pix2seq --config.dataset.coco_annotations_dir=/data/c/annotations --config.train.batch_size=8 --config.train.epochs=20 --config.optimization.learning_rate=3e-5
but get the above error. The config.dataset.data_dir is my offline coco tfds.
By the way , I wonder if this is wrong

well, i change the code in model.py
latest_ckpt, ckpt, self._verify_restored = utils.restore_from_checkpoint( model_dir, False, model=model, global_step=optimizer.iterations, optimizer=optimizer)
by
False to True, i.e. using
checkpoint.restore(latest_ckpt).expect_partial()
can avoid the error. But i still confused about that.
@chentingpc Hi, do you know how to debug with strategy.run(...) in train_multiple_steps function ? I can not step into the train_step function.
you should be able to do pdb in the code when running in eager mode