kraken icon indicating copy to clipboard operation
kraken copied to clipboard

segtrain fails in v 4.1.2

Open sven-nm opened this issue 2 years ago • 6 comments

Hi @mittagessen

Just pinpointing a little oddity there :

Running the following on 4.1.2 and torch 1.11: (Note that ketos_sample actually contains the Italian subdir from the BiblIA dataset)

from kraken.lib.train import SegmentationModel, KrakenTrainer
import glob

ground_truth = glob.glob('/Users/me/drive/ketos_sample/*.xml')
training_files = ground_truth[:3] # training data is shuffled internally
evaluation_files = ground_truth[3:]
model = SegmentationModel(training_data=training_files, evaluation_data=evaluation_files, format_type='xml')
trainer = KrakenTrainer()
trainer.fit(model)

yields:

Traceback (most recent call last):
  File "/Users/sven/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-45d4afebefac>", line 1, in <cell line: 1>
    trainer.fit(model)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/train.py", line 96, in fit
    super().fit(*args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
    self.val_loop.run()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    output = self.on_run_end()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 201, in on_run_end
    self._on_evaluation_end()
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 277, in _on_evaluation_end
    self.trainer._call_callback_hooks("on_validation_end", *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/train.py", line 123, in on_validation_end
    trainer.model.nn.save_model(f'{trainer.model.output}_{trainer.current_epoch}.mlmodel')
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/vgsl.py", line 506, in save_model
    mlmodel = MLModel(net_builder.spec)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/model.py", line 346, in __init__
    self.__proxy__, self._spec, self._framework_error = _get_proxy_and_spec(
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/model.py", line 123, in _get_proxy_and_spec
    specification = _load_spec(filename)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/utils.py", line 210, in load_spec
    raise Exception(
Exception: Unable to load libmodelpackage. Cannot make save spec.

I think this might be due to the conflict coremltools and torch 1.11 as you always get these WARNING:root:Torch version 1.11.0 has not been tested with coremltools. You may run into unexpected errors. Torch 1.10.2 is the most recent version that has been tested. However, forcing kraken 4.1.2 to run on torch 10.1.2 also fails. See below:

Traceback (most recent call last):
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/train.py", line 25, in <module>
    import pytorch_lightning as pl
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
    from pytorch_lightning.loggers import LightningLoggerBase, LoggerCollection
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loggers/__init__.py", line 18, in <module>
    from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py", line 26, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'

sven-nm avatar Aug 01 '22 16:08 sven-nm

I may add that running exactly the same on linux yields another error :

/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1162: ShapelyDeprecationWarning: The array interface is deprecated and 
will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array 
instead.
  im, target = self.transform(im, target)
/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1210: ShapelyDeprecationWarning: __getitem__ for multi-part geometries is 
deprecated and will be removed in Shapely 2.0. Use the `geoms` property to 
access the constituent parts of a multi-part geometry.
  start_sep = np.array((split(shp_line, split_pt)[0].buffer(self.line_width,
/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1217: ShapelyDeprecationWarning: __getitem__ for multi-part geometries is 
deprecated and will be removed in Shapely 2.0. Use the `geoms` property to 
access the constituent parts of a multi-part geometry.
  end_sep = np.array((split(shp_line, split_pt)[-1].buffer(self.line_width,
Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10 -:--:-- 0:00:01 
Traceback (most recent call last):
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 11, in <module>
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/train.py", line 96, in fit
    super().fit(*args, **kwargs)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
    val_loop.run()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
    batch = next(data_fetcher)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
    batch = next(iterator)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 157, in default_collate
    return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 157, in <dictcomp>
    return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 136, in default_collate
    storage = elem.storage()._new_shared(numel)
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/storage.py", line 487, in _new_shared
    untyped_storage = module._UntypedStorage._new_shared(size * cls().element_size())
  File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/storage.py", line 172, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: torch_shm_manager at "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/bin/torch_shm_manager": could not generate a random directory for manager socket

I tried this both with 4.1.2 and 4.1.0. I use pytorch quite often and never and this issue.

sven-nm avatar Aug 01 '22 17:08 sven-nm

One last precision: the torch_shm_manager was also raised when testing the same chunk in a python 3.8 environnement (previously tested only on 3.9 and 3.10) :)

sven-nm avatar Aug 24 '22 07:08 sven-nm

What's your pip freeze like?

PonteIneptique avatar Aug 24 '22 07:08 PonteIneptique

I just dug through the pytorch code to see how/where the directory is attempted to be created. Apparently you aren't allowed to create directories in /tmp. Their code should respect a manually set TMPDIR on the environment though so pointing that to some other directory should help.

For the coreml error under Mac OS X I think it might be related to coreml 5. If you could try downgrading to coreml 4 it might resolve the issue (but I've got no way to check in absence of an Apple device).

mittagessen avatar Aug 24 '22 10:08 mittagessen

@PonteIneptique here it is :

absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
async-timeout==4.0.2
attrs==22.1.0
cachetools==5.2.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.0
click==8.1.3
commonmark==0.9.1
coremltools==5.2.0
frozenlist==1.3.1
fsspec==2022.7.1
google-auth==2.10.0
google-auth-oauthlib==0.4.6
grpcio==1.47.0
idna==3.3
imageio==2.21.1
importlib-metadata==4.12.0
importlib-resources==5.9.0
Jinja2==3.1.2
jsonschema==4.9.1
kraken==4.1.2
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
mpmath==1.2.1
multidict==6.0.2
networkx==2.8.5
numpy==1.23.2
oauthlib==3.2.0
packaging==21.3
Pillow==9.2.0
pkgutil_resolve_name==1.3.10
protobuf==3.19.4
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-bidi==0.4.2
pytorch-lightning==1.7.1
PyWavelets==1.3.0
PyYAML==6.0
regex==2022.7.25
requests==2.28.1
requests-oauthlib==1.3.1
rich==12.5.1
rsa==4.9
scikit-image==0.19.2
scipy==1.9.0
Shapely==1.8.2
six==1.16.0
sympy==1.10.1
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tifffile==2022.8.12
torch==1.11.0
torchmetrics==0.9.3
torchvision==0.12.0
tqdm==4.64.0
typing_extensions==4.3.0
urllib3==1.26.11
Werkzeug==2.2.2
yarl==1.8.1
zipp==3.8.1

sven-nm avatar Aug 24 '22 12:08 sven-nm

@mittagessen thanks a lot for digging this out.

I just dug through the pytorch code to see how/where the directory is attempted to be created. Apparently you aren't allowed to create directories in /tmp. Their code should respect a manually set TMPDIR on the environment though so pointing that to some other directory should help.

I'll check this, but my TMPDIR already gets exported to a rwx dir by default. I'll also try to downgrade corelm tools and keep you tuned.

sven-nm avatar Aug 24 '22 12:08 sven-nm