sltunet
sltunet copied to clipboard
Loading pretrained model error
Hey, I'm coming again! When I do Step 3. Train SLTUnet Model, I moved required files in two folders in train.sh file and run train.sh. When the code run to loading pretrained model, I got a warning below:
INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected
INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected
How can I load pretrained model? Is pretrained model trained in Step 2? Thanks! This is my train.sh
data=preprocessed-corpus/
feature=smkd-sign-features/
python3 run.py --mode train --parameters=\
hidden_size=256,embed_size=256,filter_size=4096,\
sep_layer=0,num_encoder_layer=6,num_decoder_layer=6,\
ctc_enable=True,ctc_alpha=0.3,ctc_repeated=True,\
src_bpe_dropout=0.2,tgt_bpe_dropout=0.2,bpe_dropout_stochastic_rate=0.6,\
initializer="uniform_unit_scaling",initializer_gain=0.5,\
dropout=0.3,label_smooth=0.1,attention_dropout=0.3,relu_dropout=0.5,residual_dropout=0.4,\
max_len=256,max_img_len=512,batch_size=80,eval_batch_size=32,\
token_size=1600,batch_or_token='token',beam_size=8,remove_bpe=True,decode_alpha=1.0,\
scope_name="transformer",buffer_size=50000,data_leak_ratio=0.1,\
img_feature_size=1024,img_aug_size=11,\
clip_grad_norm=0.0,\
num_heads=4,\
process_num=2,\
lrate=1.0,\
estop_patience=100,\
warmup_steps=4000,\
epoches=5000,\
update_cycle=16,\
gpus=[0],\
disp_freq=1,\
eval_freq=500,\
sample_freq=100,\
checkpoints=5,\
best_checkpoints=10,\
max_training_steps=30000,\
nthreads=8,\
beta1=0.9,\
beta2=0.998,\
random_seed=1234,\
src_codes="$data/ende.bpe",tgt_codes="$data/ende.bpe",\
src_vocab_file="$data/vocab.zero.drop",\
tgt_vocab_file="$data/vocab.zero.drop",\
img_train_file="$feature/train.h5",\
src_train_file="$data/train.bpe.en.shuf",\
tgt_train_file="$data/train.bpe.de.shuf",\
img_dev_file="$feature/dev.h5",\
src_dev_file="$data/dev.bpe.en",\
tgt_dev_file="$data/dev.bpe.de",\
img_test_file="$feature/test.h5",\
src_test_file="$data/test.bpe.en",\
tgt_test_file="$data/test.bpe.de",\
output_dir="train",\
test_output="",\
shared_source_target_embedding=True,\
Hey, the logging information is a little bit confusing here.
The pretrained model here doesn't mean the pretrained sign embeddings, but pretrained SLT model. so it's normal and not a problem. More details are below:
INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected
It tries to restore a separately pretrained SLT model, e.g. pretrained encoders or decoders, which we never used.
INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected
It tries to restore from existing working directory. If your job got corrupted, it should recover the training from the working directory, i.e. output_dir.
Oh! I'm sorry, loading pretrained model may not the important problem. The original error seems like h5 file.
Traceback (most recent call last):
File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/data1/wanjiarui/sltunet/utils/queuer.py", line 125, in run
for data_chunk in self._data_chunk_iterable:
File "/data1/wanjiarui/sltunet/data.py", line 201, in batcher
for data in _handle_buffer(buffer):
File "/data1/wanjiarui/sltunet/data.py", line 184, in _handle_buffer
x, s, t, m, mask, spar, img_idx = self.to_matrix(batch, train)
File "/data1/wanjiarui/sltunet/data.py", line 136, in to_matrix
new_image = self.img_reader[img_key][()]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '5142_8' doesn't exist)"
When I run Step 2. command 4. I got dev.h5 test.h5 train.h5 and train_(0-9).h5 in path smkd/features and then I combine different training features and move dev/test/train.h5 to path smkd-sign-features/, which wrote in train.sh. Finally, I run the train.sh by command below in root path sltunet/ and I got KeyError above. I have run Step 2. command 4 and combine train.h5 twice. Both of them have KeyError. What should I do to check the problem? I'm really confused. Thanks!
sh example/train.sh
This could be checked by inspecting the source_train_file and the resulted train.h5.
Could you please show a few lines in your train file? and also read train.h5 with h5py and check its keys? there might be some mismatch.
After I run command below, I got dev.h5, test.h5, train.h5 and train_(0-9).h5 in sltunet/smkd/features
python main.py --load-weights avg/average.pt --phase features --device 0 --num-feature-aug 10 --work-dir exp/resnet34 --config baseline.yaml
This is my sign_feature_cmb.py file. Should I combine train.h5 and train_(0-9).h5 in a new h5 file or only combine train_(0-9).h5? I guess this line writer = h5py.File('train.h5', 'w') may overwrite train.h5 because I run this python script on the same path of those h5 files and finally I lose some data.
import sys
import glob
import h5py
files = glob.glob(sys.argv[1])
print(files)
writer = h5py.File('train.h5', 'w')
for i, f in enumerate(files):
reader = h5py.File(f, 'r')
for key in list(reader.keys()):
writer.create_dataset("%s_%s" % (key, i), data=reader[key][()])
reader.close()
writer.close()
could you please list some keys from your train.h5? e.g. 5142_8 is missing based on the error, then could you please take a look what keys for 5142 are contained in your training data?
I have solved this error. It happens when I run sign_feature_cmb.py on the same path of train.h5 and train_(0-9).h5. I show my path below.
When the script runs to writer = h5py.File('train.h5', 'w'), it open a file train.h5 with mode write. It may clean train.h5 file if exist on the path of script and write new content.
I change the line to writer = h5py.File('train123.h5', 'w') . After the script finished, I move to right path and rename it to train.h5.
My script path is:
smkd/features
├── dev.h5
├── test.h5
└── train
├── sign_feature_cmb.py
├── train_0.h5
├── train_1.h5
├── train_2.h5
├── train_3.h5
├── train_4.h5
├── train_5.h5
├── train_6.h5
├── train_7.h5
├── train_8.h5
├── train_9.h5
└── train.h5
1 directory, 12 files
When I follow the instruction below in sltunet/example, I can't get right combined train.h5 file because that after I run Step 2. extract sign features, I got directory below.
python sign_feature_cmb.py train\*h5
Directory after extract:
smkd/features
├── dev.h5
├── test.h5
├── train_0.h5
├── train_1.h5
├── train_2.h5
├── train_3.h5
├── train_4.h5
├── train_5.h5
├── train_6.h5
├── train_7.h5
├── train_8.h5
├── train_9.h5
└── train.h5