vak icon indicating copy to clipboard operation
vak copied to clipboard

torch dataloader may be corrupting .npz files?

Open NickleDave opened this issue 4 years ago • 0 comments

Occasionally I get an error when running vak predict about a bad CRC-32 for file 's.npy'

I have noticed that this happens with a dataset where I have already run vak predict multiple times (e.g. while testing for an unrelated bug that requires me to repeatedly run vak predict), and that once it happens, I am unable to run again without getting the error.

I think this may be caused by possibly torch.utils.Dataloader somehow corrupting the .npz file? Maybe because it's using multiprocessing and somehow the file is not closed correctly?

Here's a complete traceback from one occurrence:

$ vak predict pk92r45_predict_190511_v2.toml
Logging results to /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vak_outputs/predict
loading SpectScaler from path: /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vaktrain/results_211108_134532/StandardizeSpect
loading labelmap from path: /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vaktrain/results_211108_134532/labelmap.json
loading dataset to predict from csv path: /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vak_outputs/predict/190511_v2_prep_211125_210922.csv
will save annotations in .csv file: /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vak_outputs/pk92r45_190511.csv
dataset has timebins with duration: 0.002
shape of input to networks used for predictions: torch.Size([1, 152, 88])
instantiating models from model-config map:/n{'TweetyNet': {'optimizer': {'lr': 0.001}, 'network': {}, 'loss': {}, 'metrics': {}}}
loading checkpoint for TweetyNet from path: /home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vaktrain/results_211108_134532/TweetyNet/checkpoints/max-val-acc-checkpoint.pt
Loading checkpoint from:
/home/pimienta/Documents/data/vocal/avani-data/pk92r45_MMAN/vaktrain/results_211108_134532/TweetyNet/checkpoints/max-val-acc-checkpoint.pt 
running predict method of TweetyNet
batch 301 / 557:  54%|███████████████████████████████████████████████████████████                                                  | 302/557 [00:59<00:50,  5.04it/s]
Traceback (most recent call last):
  File "/home/pimienta/anaconda3/envs/vak040b3/bin/vak", line 8, in <module>
    sys.exit(main())
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/cli/cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/cli/predict.py", line 42, in predict
    core.predict(
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/core/predict.py", line 227, in predict
    pred_dict = model.predict(pred_data=pred_data, device=device)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/engine/model.py", line 478, in predict
    return self._predict(pred_data)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/engine/model.py", line 347, in _predict
    for ind, batch in enumerate(progress_bar):
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
zipfile.BadZipFile: Caught BadZipFile in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/vak/datasets/vocal_dataset.py", line 75, in __getitem__
    spect = spect_dict[self.spect_key]
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/numpy/lib/npyio.py", line 253, in __getitem__
    return format.read_array(bytes,
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/numpy/lib/format.py", line 763, in read_array
    data = _read_bytes(fp, read_size, "array data")
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/site-packages/numpy/lib/format.py", line 892, in _read_bytes
    r = fp.read(size - len(data))
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/zipfile.py", line 940, in read
    data = self._read1(n)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/zipfile.py", line 1030, in _read1
    self._update_crc(data)
  File "/home/pimienta/anaconda3/envs/vak040b3/lib/python3.8/zipfile.py", line 958, in _update_crc
    raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 's.npy'

NickleDave avatar Nov 26 '21 21:11 NickleDave