DALI + Catalyst = π
Signed-off-by: Rishabh Singh [email protected]
Description
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Refactoring (Redesign of existing code that doesn't affect functionality)
- [x] Other (e.g. Documentation, Tests, Configuration)
What happened in this PR
Additional information
- Affected modules and functionalities:
- Key points relevant for the review:
Checklist
Tests
- [ ] Existing tests apply
- [ ] New tests added
- [ ] Python tests
- [ ] GTests
- [ ] Benchmark
- [ ] Other
- [ ] N/A
Documentation
- [ ] Existing documentation applies
- [ ] Documentation updated
- [ ] Docstring
- [ ] Doxygen
- [ ] RST
- [x] Jupyter
- [ ] Other
- [ ] N/A
DALI team only
Requirements
- [ ] Implements new requirements
- [ ] Affects existing requirements
- [ ] N/A
REQ IDs: N/A
JIRA TASK: N/A
Fixes: #3426
Check out this pull request onΒ ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Hi @anonymousr007,
Thank you for your contribution. Let use review it and get back to you soon.
@anonymousr007 thank you for your hard work. I have added a couple of comments from my side. Let us know when the code it ready for another review round.
Please review π , I guess something went wrong here
Hi @anonymousr007,
The changes you made look good. Please also:
- rework the old
define_graphstyle to the new functional API. You can refer to https://github.com/NVIDIA/DALI/pull/2566, https://github.com/NVIDIA/DALI/pull/2721 and https://github.com/NVIDIA/DALI/pull/2577 PRs - please add more narrative to the example, like in the PyTorch-lightning one that adds an introduction and explains the main steps in it
!build
CI MESSAGE: [3295027]: BUILD STARTED
CI MESSAGE: [3295027]: BUILD PASSED
How much time it takes to merge ?
Hi @anonymousr007,
If CI is green and you have the approval from both reviewers it should take no more than one business day. In this case, I see the basic tests have passed, but more advanced failed:
[NbConvertApp] Converting notebook frameworks/pytorch/MNIST-catalyst-example.ipynb to notebook
Traceback (most recent call last):
File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/jupyter_core/application.py", line 264, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/traitlets/config/application.py", line 846, in launch_instance
app.start()
File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 361, in start
self.convert_notebooks()
File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 533, in convert_notebooks
self.convert_single_notebook(notebook_filename)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 498, in convert_single_notebook
output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 427, in export_single_notebook
output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 181, in from_filename
return self.from_file(f, resources=resources, **kw)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 199, in from_file
return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node
nb_copy, resources = super().from_notebook_node(nb, resources, **kw)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 143, in from_notebook_node
nb_copy, resources = self._preprocess(nb_copy, resources)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 318, in _preprocess
nbc, resc = preprocessor(nbc, resc)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/base.py", line 47, in __call__
return self.preprocess(nb, resources)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
self.preprocess_cell(cell, resources, index)
File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
cell = self.execute_cell(cell, index, store_history=True)
File "/usr/local/lib/python3.8/dist-packages/nbclient/util.py", line 78, in wrapped
return just_run(coro(*args, **kwargs))
File "/usr/local/lib/python3.8/dist-packages/nbclient/util.py", line 57, in just_run
return loop.run_until_complete(coro)
File "/usr/lib/python3.8/asyncio/base_events.py", line 608, in run_until_complete
return future.result()
File "/usr/local/lib/python3.8/dist-packages/nbclient/client.py", line 862, in async_execute_cell
self._check_raise_for_error(cell, exec_reply)
File "/usr/local/lib/python3.8/dist-packages/nbclient/client.py", line 765, in _check_raise_for_error
raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
runner = dl.SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
loaders=loaders,
num_epochs=1,
logdir="./logs",
valid_loader="valid",
valid_metric="loss",
minimize_valid_metric=True,
verbose=True,
callbacks=[
dl.AccuracyCallback(input_key="logits", target_key="targets", num_classes=10),
]
)
------------------
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_4172/158253840.py in <module>
1 runner = dl.SupervisedRunner()
2
----> 3 runner.train(
4 model=model,
5 criterion=criterion,
/usr/local/lib/python3.8/dist-packages/catalyst/runners/runner.py in train(self, loaders, model, engine, trial, criterion, optimizer, scheduler, callbacks, loggers, seed, hparams, num_epochs, logdir, valid_loader, valid_metric, minimize_valid_metric, verbose, timeit, check, overfit, load_best_on_end, fp16, amp, apex, ddp)
513 self._load_best_on_end = load_best_on_end
514 # run
--> 515 self.run()
516
517 @torch.no_grad()
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in run(self)
852 self.exception = ex
853 self._run_event("on_experiment_end")
--> 854 self._run_event("on_exception")
855 return self
856
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_event(self, event)
786 getattr(callback, event)(self)
787 if _has_str_intersections(event, ("_end", "_exception")):
--> 788 getattr(self, event)(self)
789
790 @abstractmethod
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in on_exception(self, runner)
778 def on_exception(self, runner: "IRunner"):
779 """Event handler."""
--> 780 raise self.exception
781
782 def _run_event(self, event: str) -> None:
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in run(self)
848 """
849 try:
--> 850 self._run_experiment()
851 except (Exception, KeyboardInterrupt) as ex:
852 self.exception = ex
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_experiment(self)
838 self._run_event("on_experiment_start")
839 for self.stage_key in self.stages:
--> 840 self.engine.spawn(self._run_stage)
841 self._run_event("on_experiment_end")
842
/usr/local/lib/python3.8/dist-packages/catalyst/core/engine.py in spawn(self, fn, *args, **kwargs)
136 wrapped function (if needed).
137 """
--> 138 return fn(*args, **kwargs)
139
140 def setup_process(self, rank: int = -1, world_size: int = 1):
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_stage(self, rank, world_size)
829 self._run_event("on_stage_start")
830 while self.stage_epoch_step < self.stage_epoch_len:
--> 831 self._run_epoch()
832 if self.need_early_stop:
833 self.need_early_stop = False
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_epoch(self)
822 self._run_event("on_epoch_start")
823 for self.loader_key, self.loader in self.loaders.items():
--> 824 self._run_loader()
825 self._run_event("on_epoch_end")
826
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_loader(self)
813 for self.loader_batch_step, self.batch in enumerate(self.loader):
814 with self.engine.autocast():
--> 815 self._run_batch()
816 if self.need_early_stop:
817 self.need_early_stop = False
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_batch(self)
801 def _run_batch(self) -> None:
802 self._run_event("on_batch_start")
--> 803 self.handle_batch(batch=self.batch)
804 self.batch = self.engine.sync_device(self.batch)
805 self._run_event("on_batch_end")
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in handle_batch(self, batch)
197 batch: dictionary with data batches from DataLoader.
198 """
--> 199 self.batch = {**batch, **self.forward(batch)}
200
201
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in forward(self, batch, **kwargs)
180 dict with model output batch
181 """
--> 182 output = self._process_input(batch, **kwargs)
183 output = self._process_output(output)
184 return output
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in _process_input_str(self, batch, **kwargs)
143
144 def _process_input_str(self, batch: Mapping[str, Any], **kwargs):
--> 145 output = self.model(batch[self._input_key], **kwargs)
146 return output
147
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
165 return self.module(*inputs[0], **kwargs[0])
166 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 167 outputs = self.parallel_apply(replicas, inputs, kwargs)
168 return self.gather(outputs, self.output_device)
169
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
175
176 def parallel_apply(self, replicas, inputs, kwargs):
--> 177 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
178
179 def gather(self, outputs, output_device):
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
84 output = results[i]
85 if isinstance(output, ExceptionWrapper):
---> 86 output.reraise()
87 outputs.append(output)
88 return outputs
/usr/local/lib/python3.8/dist-packages/torch/_utils.py in reraise(self)
427 # have message field
428 raise self.exc_type(message=msg)
--> 429 raise self.exc_type(msg)
430
431
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0
Can you run the notebook on your side and see if that is reproducible? In the meantime please also add more description to the notebook itself that would explain what happens in each step. You can check pytorch-lightning example for a reference.
I'm closing this pull request. Let us know if you want to still work on it.