qlib
qlib copied to clipboard
Improve pit performance
Description
see https://github.com/microsoft/qlib/issues/1671
Consider pit data, assume we have T
trade days and N
report_period record:
date | report_period | value | |
---|---|---|---|
0 | 2011-10-18 00:00:00 | 201103 | 0.318919 |
1 | 2012-03-23 00:00:00 | 201104 | 0.4039 |
2 | 2012-04-11 00:00:00 | 201004 | 0.403925 |
3 | 2012-04-11 00:00:00 | 200904 | 0.403925 |
We access PIT table in 3 Ways:
1. observe latest data each trade day
Just loop through table and keep only latest report_date
value. consume O(N)
2. observe latest several report_period
data for expression like P(Mean($$roewa_q, 2))
Read data file once.
- Loop through trade day, slice data[:tradeday],
- groupby report_period, get the last item.
- return last
X
item
Algorithm could be improved by loop back from the end until find X
different period. But groupby use C level loop which should be faster.
3. observe specific period from each trade day
Get all data belong to given period
How Has This Been Tested?
- [x] Pass the test by running:
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
. - [x] If you are adding a new feature, test on your own test scripts.
Screenshots of Test Results (if appropriate):
Types of changes
- [x] Fix bugs
- [x] Add new feature
- [ ] Update documentation
Anyone can fix main branch? CI fails due to main branch problem.
It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?
https://github.com/microsoft/qlib/blob/98f569eed2252cc7fad0c120cad44f6181c3acf6/scripts/dump_pit.py#L198-L204
The whole dump_pit.py
should be rewrited since we implement FilePitStorage
. So current dump file should look like
s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)
@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage
and LocalCalendarStorage
.