Inconsistent results between two successive runs
Describe bug
The OHLC data returned by yf.download() is not reproducible across multiple runs. For example, if I download OHLC data of SPY two times, the numbers are slightly different. Ideally, the numbers should be the same.
Simple code that reproduces your problem
Consider the following code. It downloads daily OHLC data for SPY and writes it to a file.
% cat sp500_daily_ohlc.py
import yfinance as yf
from datetime import datetime, date
df = yf.download(["SPY"], date(2023, 1, 1), date(2024, 7, 2))
time_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
file_name = f"daily_{time_stamp}.csv"
print(f"writing data into {file_name}")
df.to_csv(file_name)
Run the code twice
% python sp500_daily_ohlc.py
[*********************100%%**********************] 1 of 1 completed
writing data into daily_20240713_193451.csv
% python sp500_daily_ohlc.py
[*********************100%%**********************] 1 of 1 completed
writing data into daily_20240713_193457.csv
Ideally, these two files should be the same. But they are not.
% diff daily_20240713_193451.csv daily_20240713_193457.csv | wc -l
466
rajulocal@hogwarts ~/work/github/market_data_processor/src/inprogress
% diff daily_20240713_193451.csv daily_20240713_193457.csv | head -n 20
2,6c2,6
< 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.7543029785156,74850700
< 2023-01-04,383.17999267578125,385.8800048828125,380.0,383.760009765625,375.63201904296875,85934100
< 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.34478759765625,76970500
< 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8604736328125,104189600
< 2023-01-09,390.3699951171875,393.70001220703125,387.6700134277344,387.8599853515625,379.6451721191406,73978100
---
> 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.7542419433594,74850700
> 2023-01-04,383.17999267578125,385.8800048828125,380.0,383.760009765625,375.6319885253906,85934100
> 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.3447570800781,76970500
> 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8605041503906,104189600
> 2023-01-09,390.3699951171875,393.70001220703125,387.6700134277344,387.8599853515625,379.6451416015625,73978100
8,10c8,10
< 2023-01-11,392.2300109863281,395.6000061035156,391.3800048828125,395.5199890136719,387.1429138183594,68881100
< 2023-01-12,396.6700134277344,398.489990234375,392.4200134277344,396.9599914550781,388.5523986816406,90157700
< 2023-01-13,393.6199951171875,399.1000061035156,393.3399963378906,398.5,390.0597839355469,63903900
---
> 2023-01-11,392.2300109863281,395.6000061035156,391.3800048828125,395.5199890136719,387.14288330078125,68881100
> 2023-01-12,396.6700134277344,398.489990234375,392.4200134277344,396.9599914550781,388.55242919921875,90157700
> 2023-01-13,393.6199951171875,399.1000061035156,393.3399963378906,398.5,390.059814453125,63903900
Debug log
Code with debug mode enabled
% cat sp500_daily_ohlc.py
import yfinance as yf
from datetime import datetime, date
yf.enable_debug_mode()
df = yf.download(["SPY"], date(2023, 1, 1), date(2024, 7, 2))
time_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
file_name = f"daily_{time_stamp}.csv"
print(f"writing data into {file_name}")
df.to_csv(file_name)
Output on the first run
% python sp500_daily_ohlc.py
DEBUG Entering download()
DEBUG Disabling multithreading because DEBUG logging enabled
DEBUG Entering history()
DEBUG Entering history()
DEBUG SPY: Yahoo GET parameters: {'period1': '2023-01-01 00:00:00-05:00', 'period2': '2024-07-02 00:00:00-04:00', 'interval': '1d', 'includePrePost': False, 'events': 'div,splits,capitalGains'}
DEBUG Entering get()
DEBUG url=https://query2.finance.yahoo.com/v8/finance/chart/SPY
DEBUG params=frozendict.frozendict({'period1': 1672549200, 'period2': 1719892800, 'interval': '1d', 'includePrePost': False, 'events': 'div,splits,capitalGains'})
DEBUG Entering _get_cookie_and_crumb()
DEBUG cookie_mode = 'basic'
DEBUG Entering _get_cookie_and_crumb_basic()
DEBUG loaded persistent cookie
DEBUG reusing cookie
DEBUG crumb = 'tz63UYSUdiS'
DEBUG Exiting _get_cookie_and_crumb_basic()
DEBUG Exiting _get_cookie_and_crumb()
DEBUG response code=200
DEBUG Exiting get()
DEBUG SPY: yfinance received OHLC data: 2023-01-03 14:30:00 -> 2024-07-01 13:30:00
DEBUG SPY: OHLC after cleaning: 2023-01-03 09:30:00-05:00 -> 2024-07-01 09:30:00-04:00
DEBUG SPY: OHLC after combining events: 2023-01-03 00:00:00-05:00 -> 2024-07-01 00:00:00-04:00
DEBUG SPY: yfinance returning OHLC: 2023-01-03 00:00:00-05:00 -> 2024-07-01 00:00:00-04:00
DEBUG Exiting history()
DEBUG Exiting history()
DEBUG Exiting download()
writing data into daily_20240713_194042.csv
Output from the second run
% python sp500_daily_ohlc.py
DEBUG Entering download()
DEBUG Disabling multithreading because DEBUG logging enabled
DEBUG Entering history()
DEBUG Entering history()
DEBUG SPY: Yahoo GET parameters: {'period1': '2023-01-01 00:00:00-05:00', 'period2': '2024-07-02 00:00:00-04:00', 'interval': '1d', 'includePrePost': False, 'events': 'div,splits,capitalGains'}
DEBUG Entering get()
DEBUG url=https://query2.finance.yahoo.com/v8/finance/chart/SPY
DEBUG params=frozendict.frozendict({'period1': 1672549200, 'period2': 1719892800, 'interval': '1d', 'includePrePost': False, 'events': 'div,splits,capitalGains'})
DEBUG Entering _get_cookie_and_crumb()
DEBUG cookie_mode = 'basic'
DEBUG Entering _get_cookie_and_crumb_basic()
DEBUG loaded persistent cookie
DEBUG reusing cookie
DEBUG crumb = 'tz63UYSUdiS'
DEBUG Exiting _get_cookie_and_crumb_basic()
DEBUG Exiting _get_cookie_and_crumb()
DEBUG response code=200
DEBUG Exiting get()
DEBUG SPY: yfinance received OHLC data: 2023-01-03 14:30:00 -> 2024-07-01 13:30:00
DEBUG SPY: OHLC after cleaning: 2023-01-03 09:30:00-05:00 -> 2024-07-01 09:30:00-04:00
DEBUG SPY: OHLC after combining events: 2023-01-03 00:00:00-05:00 -> 2024-07-01 00:00:00-04:00
DEBUG SPY: yfinance returning OHLC: 2023-01-03 00:00:00-05:00 -> 2024-07-01 00:00:00-04:00
DEBUG Exiting history()
DEBUG Exiting history()
DEBUG Exiting download()
writing data into daily_20240713_194050.csv
Differences
% diff daily_20240713_194042.csv daily_20240713_194050.csv | wc -l
452
% diff daily_20240713_194042.csv daily_20240713_194050.csv | head -n 20
2c2
< 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.75421142578125,74850700
---
> 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.7542724609375,74850700
4,5c4,5
< 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.3447265625,76970500
< 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8604431152344,104189600
---
> 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.3447570800781,76970500
> 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8605041503906,104189600
7c7
< 2023-01-10,387.25,390.6499938964844,386.2699890136719,390.5799865722656,382.30755615234375,65358100
---
> 2023-01-10,387.25,390.6499938964844,386.2699890136719,390.5799865722656,382.3075256347656,65358100
11,13c11,13
< 2023-01-17,398.4800109863281,400.2300109863281,397.05999755859375,397.7699890136719,389.3452453613281,62677300
< 2023-01-18,399.010009765625,400.1199951171875,391.2799987792969,391.489990234375,383.1982116699219,99632300
< 2023-01-19,389.3599853515625,391.0799865722656,387.260009765625,388.6400146484375,380.40869140625,86958900
---
> 2023-01-17,398.4800109863281,400.2300109863281,397.05999755859375,397.7699890136719,389.34521484375,62677300
Bad data proof
No response
yfinance version
0.2.40
Python version
3.12.3
Operating system
Debian GNU/Linux 12 (bookworm)
Try the branch in PR #1984 (how to run)
Try the branch in PR #1984 (how to run)
I tried it. It did not fix the issue. Consider the following code
% cat sp500_daily_ohlc.py
import yfinance as yf
from datetime import datetime, date
auto_adjust = True
df = yf.download(["SPY"], start=date(2023, 1, 1), end=date(2024, 7, 2), auto_adjust=auto_adjust)
time_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
file_name = f"daily_auto_adjust_{auto_adjust}_{time_stamp}.csv"
print(f"writing data into {file_name}")
df.to_csv(file_name)
I ran this twice. Then I changed auto_adjust to False
% cat sp500_daily_ohlc.py
import yfinance as yf
from datetime import datetime, date
auto_adjust = False
df = yf.download(["SPY"], start=date(2023, 1, 1), end=date(2024, 7, 2), auto_adjust=auto_adjust)
time_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
file_name = f"daily_auto_adjust_{auto_adjust}_{time_stamp}.csv"
print(f"writing data into {file_name}")
df.to_csv(file_name)
I ran this twice. The files are different across all the four runs.
% md5sum daily_auto_adjust_*.csv
f470fbd816e09be9037edef54a0f4e59 daily_auto_adjust_False_20240714_142500.csv
001d65b7fa991f4d3aa2c1b8a99cd12d daily_auto_adjust_False_20240714_142503.csv
ee0e0a82eb6f7307e3a540f3cf30da26 daily_auto_adjust_True_20240714_142435.csv
3b1ff5053c992fcd5965230c96333dbd daily_auto_adjust_True_20240714_142442.csv
% diff daily_auto_adjust_False_20240714_142500.csv daily_auto_adjust_False_20240714_142503.csv | head -n 20
2,8c2,8
< 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.7542724609375,74850700
< 2023-01-04,383.17999267578125,385.8800048828125,380.0,383.760009765625,375.63201904296875,85934100
< 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.3446960449219,76970500
< 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8604431152344,104189600
< 2023-01-09,390.3699951171875,393.70001220703125,387.6700134277344,387.8599853515625,379.6451416015625,73978100
< 2023-01-10,387.25,390.6499938964844,386.2699890136719,390.5799865722656,382.3075256347656,65358100
< 2023-01-11,392.2300109863281,395.6000061035156,391.3800048828125,395.5199890136719,387.1428527832031,68881100
---
> 2023-01-03,384.3699951171875,386.42999267578125,377.8299865722656,380.82000732421875,372.7543029785156,74850700
> 2023-01-04,383.17999267578125,385.8800048828125,380.0,383.760009765625,375.6319580078125,85934100
> 2023-01-05,381.7200012207031,381.8399963378906,378.760009765625,379.3800048828125,371.3447570800781,76970500
> 2023-01-06,382.6099853515625,389.25,379.4100036621094,388.0799865722656,379.8605041503906,104189600
> 2023-01-09,390.3699951171875,393.70001220703125,387.6700134277344,387.8599853515625,379.6451110839844,73978100
> 2023-01-10,387.25,390.6499938964844,386.2699890136719,390.5799865722656,382.30755615234375,65358100
> 2023-01-11,392.2300109863281,395.6000061035156,391.3800048828125,395.5199890136719,387.14288330078125,68881100
10c10
< 2023-01-13,393.6199951171875,399.1000061035156,393.3399963378906,398.5,390.0597839355469,63903900
---
> 2023-01-13,393.6199951171875,399.1000061035156,393.3399963378906,398.5,390.059814453125,63903900
% diff daily_auto_adjust_True_20240714_142435.csv daily_auto_adjust_True_20240714_142442.csv | head
2c2
< 2023-01-03,376.2290102149442,378.2453768730092,369.8275195343226,372.75421142578125,74850700
---
> 2023-01-03,376.2290410170059,378.2454078401519,369.827549812291,372.7542419433594,74850700
4,5c4,5
< 2023-01-05,373.635161717404,373.7526153349045,370.73786295793764,371.3447265625,76970500
< 2023-01-06,374.50629665205355,381.0056756304061,371.3740906518094,379.8604431152344,104189600
---
> 2023-01-05,373.63519242321297,373.7526460503659,370.73789342564294,371.3447570800781,76970500
> 2023-01-06,374.5063267394853,381.00570623999096,371.37412048760314,379.8604736328125,104189600
You should have seen a new warning message.
You should have seen a new warning message.
Yes, I see the warning message. But it does not help with the reproducibility. For example
df = yf.download(["SPY"], start=date(2023, 1, 1), end=date(2024, 7, 2))
gives the warning message. But neither
df = yf.download(["SPY"], start=date(2023, 1, 1), end=date(2024, 7, 2), auto_adjust=True)
nor
df = yf.download(["SPY"], start=date(2023, 1, 1), end=date(2024, 7, 2), auto_adjust=False)
give results that are reproducible.
It's not meant to fix the reproducibility. Adj Close isn't consistent from Yahoo.
It's really hard to see the differences in your posts, @KamarajuKusumanchi.
I have a similar problem, but the differences in my files are only in the last digit. Sometimes the last tow digits, e.g.
1.910714030265808
vs
1.91071403026580781
I always thought these were simple rounding errors, because it happens with all versions and branches I use.
When I don't use plain diff for comparison but allow a tolerance of 0.001% the files are equal.