kraken
kraken copied to clipboard
segtrain fails in v 4.1.2
Hi @mittagessen
Just pinpointing a little oddity there :
Running the following on 4.1.2 and torch 1.11:
(Note that ketos_sample
actually contains the Italian
subdir from the BiblIA dataset)
from kraken.lib.train import SegmentationModel, KrakenTrainer
import glob
ground_truth = glob.glob('/Users/me/drive/ketos_sample/*.xml')
training_files = ground_truth[:3] # training data is shuffled internally
evaluation_files = ground_truth[3:]
model = SegmentationModel(training_data=training_files, evaluation_data=evaluation_files, format_type='xml')
trainer = KrakenTrainer()
trainer.fit(model)
yields:
Traceback (most recent call last):
File "/Users/sven/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-45d4afebefac>", line 1, in <cell line: 1>
trainer.fit(model)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/train.py", line 96, in fit
super().fit(*args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
self._run_validation()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
self.val_loop.run()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 211, in run
output = self.on_run_end()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 201, in on_run_end
self._on_evaluation_end()
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 277, in _on_evaluation_end
self.trainer._call_callback_hooks("on_validation_end", *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/train.py", line 123, in on_validation_end
trainer.model.nn.save_model(f'{trainer.model.output}_{trainer.current_epoch}.mlmodel')
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/kraken/lib/vgsl.py", line 506, in save_model
mlmodel = MLModel(net_builder.spec)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/model.py", line 346, in __init__
self.__proxy__, self._spec, self._framework_error = _get_proxy_and_spec(
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/model.py", line 123, in _get_proxy_and_spec
specification = _load_spec(filename)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.10/site-packages/coremltools/models/utils.py", line 210, in load_spec
raise Exception(
Exception: Unable to load libmodelpackage. Cannot make save spec.
I think this might be due to the conflict coremltools and torch 1.11 as you always get these WARNING:root:Torch version 1.11.0 has not been tested with coremltools. You may run into unexpected errors. Torch 1.10.2 is the most recent version that has been tested.
However, forcing kraken 4.1.2 to run on torch 10.1.2 also fails. See below:
Traceback (most recent call last):
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/train.py", line 25, in <module>
import pytorch_lightning as pl
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
from pytorch_lightning.callbacks.pruning import ModelPruning
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
from pytorch_lightning.loggers import LightningLoggerBase, LoggerCollection
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loggers/__init__.py", line 18, in <module>
from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py", line 26, in <module>
from torch.utils.tensorboard import SummaryWriter
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/sven/opt/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
I may add that running exactly the same on linux yields another error :
/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1162: ShapelyDeprecationWarning: The array interface is deprecated and
will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array
instead.
im, target = self.transform(im, target)
/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1210: ShapelyDeprecationWarning: __getitem__ for multi-part geometries is
deprecated and will be removed in Shapely 2.0. Use the `geoms` property to
access the constituent parts of a multi-part geometry.
start_sep = np.array((split(shp_line, split_pt)[0].buffer(self.line_width,
/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/datas
et.py:1217: ShapelyDeprecationWarning: __getitem__ for multi-part geometries is
deprecated and will be removed in Shapely 2.0. Use the `geoms` property to
access the constituent parts of a multi-part geometry.
end_sep = np.array((split(shp_line, split_pt)[-1].buffer(self.line_width,
Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10 -:--:-- 0:00:01
Traceback (most recent call last):
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 11, in <module>
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/kraken/lib/train.py", line 96, in fit
super().fit(*args, **kwargs)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
return self.fetching_function()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 157, in default_collate
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 157, in <dictcomp>
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 136, in default_collate
storage = elem.storage()._new_shared(numel)
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/storage.py", line 487, in _new_shared
untyped_storage = module._UntypedStorage._new_shared(size * cls().element_size())
File "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/storage.py", line 172, in _new_shared
return cls._new_using_filename(size)
RuntimeError: torch_shm_manager at "/scratch/sven/anaconda3/envs/kraken/lib/python3.9/site-packages/torch/bin/torch_shm_manager": could not generate a random directory for manager socket
I tried this both with 4.1.2 and 4.1.0. I use pytorch quite often and never and this issue.
One last precision: the torch_shm_manager
was also raised when testing the same chunk in a python 3.8 environnement (previously tested only on 3.9 and 3.10) :)
What's your pip freeze like?
I just dug through the pytorch code to see how/where the directory is attempted to be created. Apparently you aren't allowed to create directories in /tmp
. Their code should respect a manually set TMPDIR
on the environment though so pointing that to some other directory should help.
For the coreml error under Mac OS X I think it might be related to coreml 5. If you could try downgrading to coreml 4 it might resolve the issue (but I've got no way to check in absence of an Apple device).
@PonteIneptique here it is :
absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
async-timeout==4.0.2
attrs==22.1.0
cachetools==5.2.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.0
click==8.1.3
commonmark==0.9.1
coremltools==5.2.0
frozenlist==1.3.1
fsspec==2022.7.1
google-auth==2.10.0
google-auth-oauthlib==0.4.6
grpcio==1.47.0
idna==3.3
imageio==2.21.1
importlib-metadata==4.12.0
importlib-resources==5.9.0
Jinja2==3.1.2
jsonschema==4.9.1
kraken==4.1.2
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
mpmath==1.2.1
multidict==6.0.2
networkx==2.8.5
numpy==1.23.2
oauthlib==3.2.0
packaging==21.3
Pillow==9.2.0
pkgutil_resolve_name==1.3.10
protobuf==3.19.4
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-bidi==0.4.2
pytorch-lightning==1.7.1
PyWavelets==1.3.0
PyYAML==6.0
regex==2022.7.25
requests==2.28.1
requests-oauthlib==1.3.1
rich==12.5.1
rsa==4.9
scikit-image==0.19.2
scipy==1.9.0
Shapely==1.8.2
six==1.16.0
sympy==1.10.1
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tifffile==2022.8.12
torch==1.11.0
torchmetrics==0.9.3
torchvision==0.12.0
tqdm==4.64.0
typing_extensions==4.3.0
urllib3==1.26.11
Werkzeug==2.2.2
yarl==1.8.1
zipp==3.8.1
@mittagessen thanks a lot for digging this out.
I just dug through the pytorch code to see how/where the directory is attempted to be created. Apparently you aren't allowed to create directories in
/tmp
. Their code should respect a manually setTMPDIR
on the environment though so pointing that to some other directory should help.
I'll check this, but my TMPDIR
already gets exported to a rwx dir by default. I'll also try to downgrade corelm tools and keep you tuned.