Randomly missing rows when calling the same ticker n number of times in a row
Hey here's a reproducible example. A different number of rows are returned when making multiple API requests back to back. I've verified that this happens across all time intervals except 1d (at least I haven't seen this problem with setting interval = '1d').
import yfinance as yf
import pandas as pd
# today is 2022-11-21 so I can go as far back as 2922-09-24 for 2m data.
start_date = '2022-09-24'
df_nvda_2m1 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m1.shape # 14671 rows
df_nvda_2m2 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m2.shape # 14671 rows
df_nvda_2m3 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m3.shape # 14593 rows
This isn't a problem for a single ticker since I won't call the same ticker 3 times in a row within a few seconds, but since the parallelized version of yf.download and running yf.Tickers also return missing rows, I decided to pull multiple tickers in a for loop for that reason. However, even when downloading multiple tickers in a for loop I still experience missing rows. Is this expected? If so, any suggestions on how to avoid this? If I run time.sleep(60) before making a request to the next ticker this will take over 3 days to download all tickers just for a single interval.
I cannot reproduce using that code, and I am reluctant to run enough times to reproduce because could be Yahoo spam detection system limiting requests (I don't want to trigger myself).
Can you reproduce your end and (i) show the differences or (ii) write those different tables to file then upload to a fork branch?
Btw the solution to spam detection is to stop spamming with caching or pay for Plus.
I wasn't aware yfinance library had a premium version. Also not sure how to avoid the spamming or if I want to migrate to the caching library.
I ran the below code 3 times and each time the dfs had 3 rows missing in the first df.
>>> df1 = t.history(prepost=True, start='2022-11-18', interval='1m', threads=False)
>>> df1.shape
(2745, 7)
>>> df2 = t.history(prepost=True, start='2022-11-18', interval='1m', threads=False)
>>> df2.shape
(2748, 7)
Now check this out
>>> df1.index.min()
Timestamp('2022-11-18 04:00:00-0500', tz='America/New_York')
>>> df1.index.max()
Timestamp('2022-11-22 19:59:00-0500', tz='America/New_York')
>>> df2.index.min()
Timestamp('2022-11-18 04:00:00-0500', tz='America/New_York')
>>> df2.index.max()
Timestamp('2022-11-22 19:59:00-0500', tz='America/New_York')
>>> [i for i in df1.index if i not in df2.index]
[]
>>> [i for i in df2.index if i not in df1.index]
[Timestamp('2022-11-18 05:26:00-0500', tz='America/New_York'), Timestamp('2022-11-21 18:14:00-0500', tz='America/New_York'), Timestamp('2022-11-22 17:05:00-0500', tz='America/New_York')]
>>> df2[df2.index.isin(['2022-11-18 05:26:00-0500', '2022-11-21 18:14:00-0500', '2022-11-22 17:05:00-0500'])]
Open High Low Close Volume Dividends Stock Splits
Datetime
2022-11-18 05:26:00-05:00 158.190 158.19 158.18 158.180 0 0 0
2022-11-21 18:14:00-05:00 153.110 153.11 153.01 153.070 0 0 0
2022-11-22 17:05:00-05:00 160.295 160.34 160.25 160.295 0 0 0
I meant Yahoo Plus. I've never used only heard, I suppose they whitelist IP addresses.
Your issue is very difficult to debug because only you experience it. #689 raised similar problem but appears not solved. Is that code ever 100% successful? E.g. after avoiding Yahoo for few hours or even a day?
Yep it appears I'm not the only one having this issue. I experience it much more frequently with intraday data. Oh well, I guess this is what happens when you use a free service. I guess there's no fix on yfinance python libraries end is there?
I've quickly coded up a potential fix - a simple "cache" that combines multiple fetches. Call Ticker.history() twice (or more) and eventually should return all the data.
It's in Git branch dev/simple-prices-cache. ~You'll need to PIP install dateutils~ <- should already be installed by Pandas
@melgazar9 @paulmcq Any feedback on the branch?
Hey @ValueRaider I don't seem to have this issue anymore after upgrading. I'll close and reach out if I notice it again. Thanks!