qlib icon indicating copy to clipboard operation
qlib copied to clipboard

Improve pit performance

Open PaleNeutron opened this issue 8 months ago • 3 comments

Description

see https://github.com/microsoft/qlib/issues/1671

Consider pit data, assume we have T trade days and N report_period record:

date report_period value
0 2011-10-18 00:00:00 201103 0.318919
1 2012-03-23 00:00:00 201104 0.4039
2 2012-04-11 00:00:00 201004 0.403925
3 2012-04-11 00:00:00 200904 0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

Read data file once.

  • Loop through trade day, slice data[:tradeday],
    • groupby report_period, get the last item.
    • return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

  • [x] Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • [x] If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

image

Types of changes

  • [x] Fix bugs
  • [x] Add new feature
  • [ ] Update documentation

PaleNeutron avatar Oct 20 '23 03:10 PaleNeutron

Anyone can fix main branch? CI fails due to main branch problem.

PaleNeutron avatar Nov 09 '23 12:11 PaleNeutron

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

https://github.com/microsoft/qlib/blob/98f569eed2252cc7fad0c120cad44f6181c3acf6/scripts/dump_pit.py#L198-L204

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

PaleNeutron avatar Nov 28 '23 10:11 PaleNeutron

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

PaleNeutron avatar Dec 07 '23 06:12 PaleNeutron