returnn
returnn copied to clipboard
Allow postponing dataset integrity checks in `NextGenHDFDataset`
Follow-up from #1315, where @JackTemaki noticed the NextGenHDFDataset
goes through the entire data on startup to perform integrity checks -- and indeed RETURNN startup w/ that dataset is quite slow. This PR allows delaying these checks to training time, saving startup time at the cost of detecting data errors only later on.
Why do you put "chore:" into the description? Where do you get this from?
This is inconsistent to our normal commit messages, so we should not use this.
Does this really need to be a new option? I personally prefer to have fewer options if possible, esp if not really needed. Can't we just always do those integrity checks lazily?
Please check failing tests.
returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
returnn/datasets/hdf.py:550: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
returnn/datasets/hdf.py:550: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
returnn/datasets/hdf.py:551: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
returnn/datasets/hdf.py:551: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
returnn/datasets/hdf.py:605: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
returnn/datasets/hdf.py:605: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
returnn/datasets/hdf.py:606: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
returnn/datasets/hdf.py:606: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
returnn/datasets/hdf.py:670: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
returnn/datasets/hdf.py:670: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
returnn/datasets/hdf.py:671: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
returnn/datasets/hdf.py:671: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
Please use PyCharm to directly see those inspections. In general, be consistent to other code.
I guess @JackTemaki and/or @patrick-wilken should otherwise review this.
Can't we just always do those integrity checks lazily?
I don't really have an opinion for or against that here -- it's mainly detecting data format errors early vs. saving time. If you're sure your data is good there is no reason to do these checks eagerly.