qlib Improve pit performance

Improve pit performance

Open PaleNeutron opened this issue 8 months ago • 3 comments

Description

see https://github.com/microsoft/qlib/issues/1671

Consider pit data, assume we have T trade days and N report_period record:

	date	report_period	value
0	2011-10-18 00:00:00	201103	0.318919
1	2012-03-23 00:00:00	201104	0.4039
2	2012-04-11 00:00:00	201004	0.403925
3	2012-04-11 00:00:00	200904	0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`

Read data file once.

Loop through trade day, slice data[:tradeday],
- groupby report_period, get the last item.
- return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

[x] Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
[x] If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Types of changes

[x] Fix bugs
[x] Add new feature
[ ] Update documentation

Oct 20 '23 03:10 PaleNeutron

Anyone can fix main branch? CI fails due to main branch problem.

Nov 09 '23 12:11 PaleNeutron

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

https://github.com/microsoft/qlib/blob/98f569eed2252cc7fad0c120cad44f6181c3acf6/scripts/dump_pit.py#L198-L204

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

Nov 28 '23 10:11 PaleNeutron

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

Dec 07 '23 06:12 PaleNeutron

qlib qlib copied to clipboard

Improve pit performance

Description

1. observe latest data each trade day

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

3. observe specific period from each trade day

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

qlib
qlib copied to clipboard

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`