returnn Allow postponing dataset integrity checks in `NextGenHDFDataset`

Follow-up from #1315, where @JackTemaki noticed the NextGenHDFDataset goes through the entire data on startup to perform integrity checks -- and indeed RETURNN startup w/ that dataset is quite slow. This PR allows delaying these checks to training time, saving startup time at the cost of detecting data errors only later on.

May 05 '23 13:05 NeoLegends

Why do you put "chore:" into the description? Where do you get this from?

This is inconsistent to our normal commit messages, so we should not use this.

May 08 '23 11:05 albertz

Does this really need to be a new option? I personally prefer to have fewer options if possible, esp if not really needed. Can't we just always do those integrity checks lazily?

May 08 '23 11:05 albertz

Please check failing tests.

  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:550: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:550: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:551: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:551: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:605: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:605: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:606: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:606: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:670: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:670: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:671: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:671: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring

Please use PyCharm to directly see those inspections. In general, be consistent to other code.

May 08 '23 11:05 albertz

I guess @JackTemaki and/or @patrick-wilken should otherwise review this.

May 08 '23 11:05 albertz

Can't we just always do those integrity checks lazily?

I don't really have an opinion for or against that here -- it's mainly detecting data format errors early vs. saving time. If you're sure your data is good there is no reason to do these checks eagerly.

May 08 '23 12:05 NeoLegends

returnn returnn copied to clipboard

Allow postponing dataset integrity checks in `NextGenHDFDataset`

returnn
returnn copied to clipboard