dump_bin DumpDataUpdate mode append data error
🐛 Bug Description
At first, I used Dump_bin's DumpDataAll mode to import data it worked fine. Part of the imported data is as follows df[df['instrument']=='SH600306'] Out[35]: instrument datetime $volume $factor $close 41691 SH600306 2024-04-23 1022018.0 0.281253 0.686257 41692 SH600306 2024-04-24 1372334.0 0.281253 0.652507 41693 SH600306 2024-04-25 951008.0 0.281253 0.618756 41694 SH600306 2024-04-26 1968818.0 0.281253 0.587818 41695 SH600306 2024-04-29 1532764.0 0.281253 0.559693
But when I append new data with DumpDataUpdate, there is an error. The original data is as follows dfraw.loc[(dfraw['date']>'2024-04-29'),['instrument','date','close']] Out[54]: instrument date close 4356 SH600306 2024-05-29 0.098438 4357 SH600306 2024-05-30 0.092813 4358 SH600306 2024-05-31 0.101251 4359 SH600306 2024-06-03 0.092813 4360 SH600306 2024-06-04 0.095626 4361 SH600306 2024-06-05 0.092813 4362 SH600306 2024-06-06 0.092813 4363 SH600306 2024-06-07 0.095626 4364 SH600306 2024-06-11 0.090001 4365 SH600306 2024-06-12 0.090001 4366 SH600306 2024-06-13 0.087188 4367 SH600306 2024-06-14 0.081563
Some of the imported data is shown below
dfnew[dfnew.instrument=='SH600306'] Out[8]: instrument datetime $volume $factor $close 10288 SH600306 2024-04-22 363992.0 0.281253 0.722820 10289 SH600306 2024-04-23 1022018.0 0.281253 0.686257 10290 SH600306 2024-04-24 1372334.0 0.281253 0.652507 10291 SH600306 2024-04-25 951008.0 0.281253 0.618756 10292 SH600306 2024-04-26 1968818.0 0.281253 0.587818 10293 SH600306 2024-04-29 1532764.0 0.281253 0.559693 10294 SH600306 2024-04-30 188390272.0 0.281253 0.098438 10295 SH600306 2024-05-06 117053368.0 0.281253 0.092813 10296 SH600306 2024-05-07 99965448.0 0.281253 0.101251 10297 SH600306 2024-05-08 85975896.0 0.281253 0.092813 10298 SH600306 2024-05-09 46003664.0 0.281253 0.095626 10299 SH600306 2024-05-10 61825620.0 0.281253 0.092813 10300 SH600306 2024-05-13 26138518.0 0.281253 0.092813 10301 SH600306 2024-05-14 19884768.0 0.281253 0.095626 10302 SH600306 2024-05-15 24197052.0 0.281253 0.090001 10303 SH600306 2024-05-16 12483558.0 0.281253 0.090001 10304 SH600306 2024-05-17 9390678.0 0.281253 0.087188 10305 SH600306 2024-05-20 27141916.0 0.281253 0.081563
I am hoping to debug dump_bin.py to find the problem. I ran it to here,the following code may be problem.
def _data_to_bin(self, df: pd.DataFrame, calendar_list: List[pd.Timestamp], features_dir: Path):
if df.empty:
logger.warning(f"{features_dir.name} data is None or empty")
return
if not calendar_list:
logger.warning("calendar_list is empty")
return
# align index
_df = self.data_merge_calendar(df, calendar_list)
if _df.empty:
logger.warning(f"{features_dir.name} data is not in calendars")
return
When align index, calendar_list does not contain dates such as 2024-05-06, but SH600306 data is empty in these days.
My guess is that your data is not normalized causing this issue. I tried using the command:
python scripts/get_data.py qlib_data --target_dir <user data dir> --region cn
Download the data, confirm that SH600306 exists in this data, and then use the command:
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --end_date <end date>
Performing an incremental update on the downloaded data did not happen as you described. It is recommended to use this method for incremental updates to the data.