qlib icon indicating copy to clipboard operation
qlib copied to clipboard

Loading high frequency data is too slow.

Open heury opened this issue 3 years ago • 7 comments

I am going to use qlib for intraday

I updated 1min data in real time to reflect the stock market. And I tried to load last 10 minutes data using following codes.

df = D.features(D.instruments('csi300'), ['$open', '$high', '$low', '$close', '$factor'], start_time='2020-09-14 09:30:00', end_time='2021-09-14 09:40:00', freq="1min")

I expect it will take less than 0.5s but it takes above 30s It is much slower than traditional database. What did I do wrong?

heury avatar Apr 16 '22 03:04 heury

I try your code and it takes less the 0.5s, there should be something wrong. Just try to set dataset_cache=None when you init qlib? Also can you show me your all code?

bxdd avatar Apr 21 '22 08:04 bxdd

I got same result when I try to set dataset_cache=None in the init step. I use following codes to test it.

class MyTestCase(unittest.TestCase):
    def setUp(self):
        provider_uri = "~/.qlib/qlib_data/cn_data_1min"  # target_dir
        qlib.init(provider_uri=provider_uri, region=REG_CN, dataset_cache=None)

    def test_dataset(self):
        start = time.time()

        df = D.features(D.instruments('csi300'), ['$open', '$high', '$low', '$close', '$factor'], start_time='2020-09-14 09:30:00',
                        end_time='2021-09-14 09:40:00', freq="1min")
        
        print(time.time()-start)

heury avatar Apr 21 '22 10:04 heury

How about set expression_cache=None also? And what configuration (memory, CPU) are you running the code on?

bxdd avatar Apr 25 '22 08:04 bxdd

I alos got same result with "expression_cache=None" I execute profilling tool with above codes and got following results.

image

My system infomation is as follows.

CPU: AMD Ryzen 9 3900X 12-Core Processor 3.80 GHz RAM: 64.0GB OS: Windows 10 64bit pro

heury avatar Apr 26 '22 22:04 heury

You can try to run it in linux. For high frequency related code, I think it may be better to run it in linux.

bxdd avatar Apr 30 '22 07:04 bxdd

OK. I will try it. And would you explain more detail about it? I wonder why linux is better than window for high frequency related code. I noticed thread lock takes so much time in the profiling result. Is it related with windows os?

heury avatar May 02 '22 08:05 heury

Sorry, I don't know much about the process management mechanism of windows, but I guess there may be several reasons for this:

  1. The efficiency of windows IPC (interprocess communication) may be low, which makes the transmission of high-frequency data between parent and child processes take more time
  2. Windows process switching may be slower

There may be more possibilities, but they can all be verified by some methods. If you have any opinions, please share them with me.

bxdd avatar Jun 03 '22 07:06 bxdd

This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days

github-actions[bot] avatar Sep 01 '22 09:09 github-actions[bot]