qlib
qlib copied to clipboard
DRAFT add Data Health Checker
Description
First draft for a data health checker as discussed in #854. The checker receives a path to the data in CSV or qlib format (not implemented yet). It will convert the data to a DataFrame and perform basic checks for data completeness and correctness.
I am not too familiar with the qlib data handling yet, so I am hoping to get some first feedback on whether this goes in the right direction.
Motivation and Context
See #854. This was an issue where a user would get a non-meaningful error message when his data did not adhere to the format (specifically the "volume" column was named "vol"). When checking the data of #854 with this checker, the user would get:
[...]
ERROR:root:002645.SZ.csv: Missing columns ['volume'] of required columns ['open', 'high', 'low', 'close', 'volume'].
WARNING:root:002645.SZ.csv: Missing 'factor' column, trading unit will be disabled.
Summary of data health check (4220 files checked):
-----------------------
Problem Count Affected columns
MISSING_REQUIRED_COLUMN 4220 {'volume'}
MISSING_DATA 0 -
LARGE_STEP_CHANGE 14 {'low', 'open', 'close', 'high'}
MISSING_FACTOR 4220 {'factor'}
Note: the large step change uses two configurable thresholds (one for price and one for volume) and checks only step changes in OHLCV columns.
How Has This Been Tested?
No tests yet as this is only a first draft
- [ ] Pass the test by running:
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
. - [ ] If you are adding a new feature, test on your own test scripts.
Screenshots of Test Results (if appropriate):
- Pipeline test:
- Your own tests:
Types of changes
- [ ] Fix bugs
- [ ] Add new feature
- [ ] Update documentation
@microsoft-github-policy-service agree
Add unit tests in qlib.test