DALI DALI + Catalyst = 🚀

Signed-off-by: Rishabh Singh [email protected]

Description

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Refactoring (Redesign of existing code that doesn't affect functionality)
[x] Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

Additional information

Affected modules and functionalities:

Key points relevant for the review:

Checklist

Tests

[ ] Existing tests apply
[ ] New tests added
[ ] Python tests
[ ] GTests
[ ] Benchmark
[ ] Other
[ ] N/A

Documentation

[ ] Existing documentation applies
[ ] Documentation updated
- [ ] Docstring
- [ ] Doxygen
- [ ] RST
- [x] Jupyter
- [ ] Other
[ ] N/A

DALI team only

Requirements

[ ] Implements new requirements
[ ] Affects existing requirements
[ ] N/A

REQ IDs: N/A

JIRA TASK: N/A

Fixes: #3426

Oct 26 '21 13:10 anonymousr007

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Oct 26 '21 13:10 review-notebook-app[bot]

Hi @anonymousr007,

Thank you for your contribution. Let use review it and get back to you soon.

Oct 26 '21 15:10 JanuszL

@anonymousr007 thank you for your hard work. I have added a couple of comments from my side. Let us know when the code it ready for another review round.

Oct 26 '21 15:10 JanuszL

Please review 😅 , I guess something went wrong here

Oct 26 '21 18:10 anonymousr007

Hi @anonymousr007,

The changes you made look good. Please also:

rework the old define_graph style to the new functional API. You can refer to https://github.com/NVIDIA/DALI/pull/2566, https://github.com/NVIDIA/DALI/pull/2721 and https://github.com/NVIDIA/DALI/pull/2577 PRs
please add more narrative to the example, like in the PyTorch-lightning one that adds an introduction and explains the main steps in it

Oct 26 '21 22:10 JanuszL

!build

Oct 29 '21 19:10 JanuszL

CI MESSAGE: [3295027]: BUILD STARTED

Oct 29 '21 19:10 dali-automaton

CI MESSAGE: [3295027]: BUILD PASSED

Oct 29 '21 20:10 dali-automaton

How much time it takes to merge ?

Oct 30 '21 14:10 anonymousr007

Hi @anonymousr007,

If CI is green and you have the approval from both reviewers it should take no more than one business day. In this case, I see the basic tests have passed, but more advanced failed:

[NbConvertApp] Converting notebook frameworks/pytorch/MNIST-catalyst-example.ipynb to notebook
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/jupyter_core/application.py", line 264, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/traitlets/config/application.py", line 846, in launch_instance
    app.start()
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 361, in start
    self.convert_notebooks()
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 533, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 498, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/nbconvertapp.py", line 427, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 181, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 199, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node
    nb_copy, resources = super().from_notebook_node(nb, resources, **kw)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 143, in from_notebook_node
    nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/exporters/exporter.py", line 318, in _preprocess
    nbc, resc = preprocessor(nbc, resc)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/base.py", line 47, in __call__
    return self.preprocess(nb, resources)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
    self.preprocess_cell(cell, resources, index)
  File "/usr/local/lib/python3.8/dist-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
  File "/usr/local/lib/python3.8/dist-packages/nbclient/util.py", line 78, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/nbclient/util.py", line 57, in just_run
    return loop.run_until_complete(coro)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 608, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/nbclient/client.py", line 862, in async_execute_cell
    self._check_raise_for_error(cell, exec_reply)
  File "/usr/local/lib/python3.8/dist-packages/nbclient/client.py", line 765, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
runner = dl.SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    num_epochs=1,
    logdir="./logs",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    verbose=True,
    callbacks=[
        dl.AccuracyCallback(input_key="logits", target_key="targets", num_classes=10),
    ]
)
------------------
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_4172/158253840.py in <module>
      1 runner = dl.SupervisedRunner()
      2 
----> 3 runner.train(
      4     model=model,
      5     criterion=criterion,
/usr/local/lib/python3.8/dist-packages/catalyst/runners/runner.py in train(self, loaders, model, engine, trial, criterion, optimizer, scheduler, callbacks, loggers, seed, hparams, num_epochs, logdir, valid_loader, valid_metric, minimize_valid_metric, verbose, timeit, check, overfit, load_best_on_end, fp16, amp, apex, ddp)
    513         self._load_best_on_end = load_best_on_end
    514         # run
--> 515         self.run()
    516 
    517     @torch.no_grad()
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in run(self)
    852             self.exception = ex
    853             self._run_event("on_experiment_end")
--> 854             self._run_event("on_exception")
    855         return self
    856 
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_event(self, event)
    786             getattr(callback, event)(self)
    787         if _has_str_intersections(event, ("_end", "_exception")):
--> 788             getattr(self, event)(self)
    789 
    790     @abstractmethod
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in on_exception(self, runner)
    778     def on_exception(self, runner: "IRunner"):
    779         """Event handler."""
--> 780         raise self.exception
    781 
    782     def _run_event(self, event: str) -> None:
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in run(self)
    848         """
    849         try:
--> 850             self._run_experiment()
    851         except (Exception, KeyboardInterrupt) as ex:
    852             self.exception = ex
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_experiment(self)
    838         self._run_event("on_experiment_start")
    839         for self.stage_key in self.stages:
--> 840             self.engine.spawn(self._run_stage)
    841         self._run_event("on_experiment_end")
    842 
/usr/local/lib/python3.8/dist-packages/catalyst/core/engine.py in spawn(self, fn, *args, **kwargs)
    136             wrapped function (if needed).
    137         """
--> 138         return fn(*args, **kwargs)
    139 
    140     def setup_process(self, rank: int = -1, world_size: int = 1):
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_stage(self, rank, world_size)
    829         self._run_event("on_stage_start")
    830         while self.stage_epoch_step < self.stage_epoch_len:
--> 831             self._run_epoch()
    832             if self.need_early_stop:
    833                 self.need_early_stop = False
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_epoch(self)
    822         self._run_event("on_epoch_start")
    823         for self.loader_key, self.loader in self.loaders.items():
--> 824             self._run_loader()
    825         self._run_event("on_epoch_end")
    826 
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_loader(self)
    813             for self.loader_batch_step, self.batch in enumerate(self.loader):
    814                 with self.engine.autocast():
--> 815                     self._run_batch()
    816                 if self.need_early_stop:
    817                     self.need_early_stop = False
/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py in _run_batch(self)
    801     def _run_batch(self) -> None:
    802         self._run_event("on_batch_start")
--> 803         self.handle_batch(batch=self.batch)
    804         self.batch = self.engine.sync_device(self.batch)
    805         self._run_event("on_batch_end")
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in handle_batch(self, batch)
    197             batch: dictionary with data batches from DataLoader.
    198         """
--> 199         self.batch = {**batch, **self.forward(batch)}
    200 
    201 
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in forward(self, batch, **kwargs)
    180             dict with model output batch
    181         """
--> 182         output = self._process_input(batch, **kwargs)
    183         output = self._process_output(output)
    184         return output
/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py in _process_input_str(self, batch, **kwargs)
    143 
    144     def _process_input_str(self, batch: Mapping[str, Any], **kwargs):
--> 145         output = self.model(batch[self._input_key], **kwargs)
    146         return output
    147 
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    165             return self.module(*inputs[0], **kwargs[0])
    166         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 167         outputs = self.parallel_apply(replicas, inputs, kwargs)
    168         return self.gather(outputs, self.output_device)
    169 
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    175 
    176     def parallel_apply(self, replicas, inputs, kwargs):
--> 177         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    178 
    179     def gather(self, outputs, output_device):
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     84         output = results[i]
     85         if isinstance(output, ExceptionWrapper):
---> 86             output.reraise()
     87         outputs.append(output)
     88     return outputs
/usr/local/lib/python3.8/dist-packages/torch/_utils.py in reraise(self)
    427             # have message field
    428             raise self.exc_type(message=msg)
--> 429         raise self.exc_type(msg)
    430 
    431 
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0

Can you run the notebook on your side and see if that is reproducible? In the meantime please also add more description to the notebook itself that would explain what happens in each step. You can check pytorch-lightning example for a reference.

Nov 02 '21 09:11 JanuszL

I'm closing this pull request. Let us know if you want to still work on it.

Sep 12 '22 11:09 JanuszL