qlib icon indicating copy to clipboard operation
qlib copied to clipboard

为什么数据缺失在中段以及首尾的处理方式不一样

Open wangyuelucky opened this issue 3 years ago • 2 comments

你好,咨询一个场景。

https://github.com/microsoft/qlib/blob/687edd79d0ee75fbf61bf1c1198ac130ef8f5b5c/scripts/dump_bin.py#L199-L202

如上述代码,如果在calendars首尾缺失值,则没有任何数值的占位,仅仅是arr[0] 记录start_index; 但是,如果是calendars中间位置缺失值,则会填充NAN.

这样处理的原因是什么? 都用NAN填充有何影响?

例子1:

calendars: 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05

0000001这只股票的数据如下: 2017-01-01 5.6 2017-01-02 5.8 2017-01-05 5.9

生成的bin内容如下: [5.6, 5.8, nan, nan, 5.9]

例子2:

calendars: 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05

0000001这只股票的数据如下: 2017-01-02 5.6 2017-01-03 5.8 2017-01-04 5.9

生成的bin内容如下: [1, 5.6, 5.8, 5.9]

期望

例子2中生成的bin内容为[nan, 5.6, 5.8, 5.9, nan] 是否可以?

wangyuelucky avatar Jul 26 '22 11:07 wangyuelucky

有人遇到类似情况吗?

wangyuelucky avatar Aug 04 '22 01:08 wangyuelucky

I assume that the reason why feature values in the middle of the date range need to be nan-filled, is because the binary database requires the .bin file to be "compact" with regard to the calendar, so that the offset of the file can be easlily calculated.

However, you gaps at the beginning or the end of the .bin file does not break the "compactness" requirement.

pop0121 avatar Sep 13 '22 13:09 pop0121

This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days

github-actions[bot] avatar Dec 12 '22 15:12 github-actions[bot]