yfinance icon indicating copy to clipboard operation
yfinance copied to clipboard

Randomly missing rows when calling the same ticker n number of times in a row

Open melgazar9 opened this issue 3 years ago • 6 comments

Hey here's a reproducible example. A different number of rows are returned when making multiple API requests back to back. I've verified that this happens across all time intervals except 1d (at least I haven't seen this problem with setting interval = '1d').

import yfinance as yf
import pandas as pd

# today is 2022-11-21 so I can go as far back as 2922-09-24 for 2m data.
start_date = '2022-09-24'

df_nvda_2m1 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m1.shape # 14671 rows

df_nvda_2m2 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m2.shape # 14671 rows

df_nvda_2m3 = yf.Ticker('NVDA').history(start=start_date, prepost=True, threads=False, interval='2m')
df_nvda_2m3.shape # 14593 rows

This isn't a problem for a single ticker since I won't call the same ticker 3 times in a row within a few seconds, but since the parallelized version of yf.download and running yf.Tickers also return missing rows, I decided to pull multiple tickers in a for loop for that reason. However, even when downloading multiple tickers in a for loop I still experience missing rows. Is this expected? If so, any suggestions on how to avoid this? If I run time.sleep(60) before making a request to the next ticker this will take over 3 days to download all tickers just for a single interval.

melgazar9 avatar Nov 22 '22 02:11 melgazar9

I cannot reproduce using that code, and I am reluctant to run enough times to reproduce because could be Yahoo spam detection system limiting requests (I don't want to trigger myself).

Can you reproduce your end and (i) show the differences or (ii) write those different tables to file then upload to a fork branch?

Btw the solution to spam detection is to stop spamming with caching or pay for Plus.

ValueRaider avatar Nov 22 '22 11:11 ValueRaider

I wasn't aware yfinance library had a premium version. Also not sure how to avoid the spamming or if I want to migrate to the caching library.

I ran the below code 3 times and each time the dfs had 3 rows missing in the first df.

>>> df1 = t.history(prepost=True, start='2022-11-18', interval='1m', threads=False)
>>> df1.shape
(2745, 7)
>>> df2 = t.history(prepost=True, start='2022-11-18', interval='1m', threads=False)
>>> df2.shape
(2748, 7)

Now check this out

>>> df1.index.min()
Timestamp('2022-11-18 04:00:00-0500', tz='America/New_York')
>>> df1.index.max()
Timestamp('2022-11-22 19:59:00-0500', tz='America/New_York')
>>> df2.index.min()
Timestamp('2022-11-18 04:00:00-0500', tz='America/New_York')
>>> df2.index.max()
Timestamp('2022-11-22 19:59:00-0500', tz='America/New_York')

>>> [i for i in df1.index if i not in df2.index]
[]
>>> [i for i in df2.index if i not in df1.index]
[Timestamp('2022-11-18 05:26:00-0500', tz='America/New_York'), Timestamp('2022-11-21 18:14:00-0500', tz='America/New_York'), Timestamp('2022-11-22 17:05:00-0500', tz='America/New_York')]

>>> df2[df2.index.isin(['2022-11-18 05:26:00-0500', '2022-11-21 18:14:00-0500', '2022-11-22 17:05:00-0500'])]
                              Open    High     Low    Close  Volume  Dividends  Stock Splits
Datetime                                                                                    
2022-11-18 05:26:00-05:00  158.190  158.19  158.18  158.180       0          0             0
2022-11-21 18:14:00-05:00  153.110  153.11  153.01  153.070       0          0             0
2022-11-22 17:05:00-05:00  160.295  160.34  160.25  160.295       0          0             0

melgazar9 avatar Nov 23 '22 08:11 melgazar9

I meant Yahoo Plus. I've never used only heard, I suppose they whitelist IP addresses.

Your issue is very difficult to debug because only you experience it. #689 raised similar problem but appears not solved. Is that code ever 100% successful? E.g. after avoiding Yahoo for few hours or even a day?

ValueRaider avatar Nov 23 '22 22:11 ValueRaider

Yep it appears I'm not the only one having this issue. I experience it much more frequently with intraday data. Oh well, I guess this is what happens when you use a free service. I guess there's no fix on yfinance python libraries end is there?

melgazar9 avatar Nov 23 '22 22:11 melgazar9

I've quickly coded up a potential fix - a simple "cache" that combines multiple fetches. Call Ticker.history() twice (or more) and eventually should return all the data.

It's in Git branch dev/simple-prices-cache. ~You'll need to PIP install dateutils~ <- should already be installed by Pandas

ValueRaider avatar Nov 23 '22 23:11 ValueRaider

@melgazar9 @paulmcq Any feedback on the branch?

ValueRaider avatar Jan 06 '23 23:01 ValueRaider

Hey @ValueRaider I don't seem to have this issue anymore after upgrading. I'll close and reach out if I notice it again. Thanks!

melgazar9 avatar Feb 19 '24 06:02 melgazar9