yfinance
yfinance copied to clipboard
Simple ticker request, causes the get_json() requests to loop 3 times
A simple ticker request (for example, using a line like this:
python -c "import yfinance as y;t=y.Ticker('TCEHY');print(t.cashflow);"
Causes the get_json()
(requests) function to be called 3 times!
There is no reason why this should need to be looped 3 times. Everything we need can (? or should) be gotten from one request, if it is called with the correct URL parameters.
@eabase i think that either there’s a cache mechanism such that effectively only one full fetch from web is done, Or, that 3 different datas are fetched (info, then additional data as the user requires in subsequent calls after the .Ticker())
I don’t thing there’s unnecessary redundancy here Though if you Are correct - a lot of time can be saved (in my scanner I fetch all nasdaq symbols one by one so cutting this time by like 1/3 sounds terrific)
can you elaborate on where these 3 get_json calls are made exactly and I’ll take a deeper look?
python -c "import yfinance as y;t=y.Ticker('TCEHY');print(t.cashflow);"
@asafravid Just put a print statement in the get_json()
and you'll see.
@eabase checking Will update findings
@eabase -> typo its get_json()
(minor comment)
@eabase yes its what I mentioned:
3 different datas are fetched (info, then additional data as the user requires in subsequent calls after the
.Ticker()
)
3 different urls are fetched:
ticker_url = "{}/{}".format(self._scrape_url, self.ticker)
# get info and sustainability
data = utils.get_json(ticker_url, proxy, self.session)
Then
# get fundamentals
data = utils.get_json(ticker_url + '/financials', proxy, self.session)
And then
# Analysis
data = utils.get_json(ticker_url + '/analysis', proxy, self.session)
And more so
ticker_url = "{}/{}".format(self._scrape_url, self.ticker)
# get info and sustainability
data = utils.get_json(ticker_url, proxy, self.session)
So I don't think it's redundancy
Your thoughts?
@asafravid No need to run 3 different URL's Put this in you browser:
https://query2.finance.yahoo.com/v11/finance/quoteSummary/FE?lang=en®ion=US&modules=assetProfile%2CsummaryProfile%2CsummaryDetail%2CesgScores%2Cprice%2CincomeStatementHistory%2CincomeStatementHistoryQuarterly%2CbalanceSheetHistory%2CbalanceSheetHistoryQuarterly%2CcashflowStatementHistory%2CcashflowStatementHistoryQuarterly%2CdefaultKeyStatistics%2CfinancialData%2CcalendarEvents%2CsecFilings%2CrecommendationTrend%2CupgradeDowngradeHistory%2CinstitutionOwnership%2CfundOwnership%2CmajorDirectHolders%2CmajorHoldersBreakdown%2CinsiderTransactions%2CinsiderHolders%2CnetSharePurchaseActivity%2Cearnings%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend
It will give you the following json (need to unzip): FE.zip
However, I couldn't check how many requests it makes when using browser. (Please check!) When I did the same thing using curl (but with the API key URL), and a header, it ended up doing a request for each of the 30 modules
. So I immediately ran out of free requests (100) with the error in the json as: {"message":"Limit Exceeded"}
. (We should check for this!)
Then I added the keep-alive
to the request header and it seem to have only given me one, but can't test more for the next 24hr.
If using using our request definition in the get_json()
, it seem that we should add this header:
"Connection: keep-alive"
when doing this one. For other single use requests, I think it should be close
, but I don't know, because IDK under what conditions the connection is closed.
Looking at all the 30 modules, it is clear that we usually only need 1 or 2 at the time. This would save considerable time and bandwidth as the full 30 take ~10-12 sec. to complete. Therefore I suggest to build a lookup table with all the items in the full request, and map each of them to it's respective module, and then form the request URL (with the needed modules) from the lookup table.
I now just noticed that old the json returned from the HTML scrape seem 4x larger than the one above, because it contains all the descriptions and all sorts of additional webpage url and link info.
Compare the above with the html, here obtained by this curl:
curl -s -A "curl/7.55.1" -H "Accept: application/json" -H "Content-Type: application/json" https://finance.yahoo.com/quoteSummary/FE/key-statistics?p={FE} | grep "root.App.main = " | sed -e "s/root.App.main = //" |sed -e 's/.$//' >from_html.json
Can you please list the 30 modules? I’m not sure getting all of them at once (albeit a single request) will take less than 3 or 4 modules (consider 8000 stocks/symbols scanned one by one)
if you can provide code snippet to implement the LUT (maybe some dict?) I can check if it takes less than the current implementation Surely these are 3 different web pages right? You are saying I can get all that data at once? What is the url for it? I need to see some code which implements your suggested single call +LUT, which we can take into the yfinance code and I’ll test it’s performance vs current implementation
Hi @asafravid
Can you please list the 30 modules?
They're all listed in the 1st URL in my previous post, separated by %2C
(",").
if you can provide code snippet to implement the LUT
Unfortunately I don't have time to look into this at the moment, and it was just a suggestion how to (possibly?) do it better. It's probably a bit tedious to track down all the paths and put it in a LUT, but should be doable in an afternoon. You could even construct this automatically by downloading it the very first time (once in day?) and use selected parts in subsequent requests.
What is more disturbing is that there is a complete lack of error checking for the request made in that function, and that there is no switch to be able to get at least a minimal amount of debug info, such as the request header made or returned. A good start would be to add all the request exceptions shown here, check for limit exceeded
and the API key issues, in case someone is using that. (Because many issues here, seem related to throttling.)
It seem to me the issue here, is really that you insist on scraping a "webpage", that just happen to have a json part embedded in it ("root.App.main = "
), instead of requesting json directly using correct headers and properly constructed URL's in the requests. We should probably implement both, separately. 1st try to get it from the json request, and 2nd get other missing (if any) stuff from the html (webscrape) with embedded json, as is currently done.
@eabase I see Need to measure / compare the results of doing 30 at once or 3 x specific urls... Hopefully sometime in the future (or someone else can help) measure/check this Might indeed be the cause for the throttling. @eabase can you link the # of the throtteling issues? that way your important insights above could be related to a certain fix for these throttling issues
I'll pin this issue, maybe someone with relevant skills has time to implement an improvement.
@fredrik-corneliusson Does your recent work address this?
@ValueRaider Yes is should. However the v0.2 uses another way of fetching the data (and the output has changed) so it cannot be compared directly. Here are the different results for v0.1, v0.2rc2 and the dev branch with my changes (I shortened the URLs with lots of parameters): V0.1.87
$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/holders HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query1.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query1.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY... HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/financials HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/analysis HTTP/1.1" 200 None
v0.2rc2
$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/holders HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query1.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query1.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY&type=trailingPegRatio&period1=1652910095&period2=1668724895 HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/financials HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/balance-sheet HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/cash-flow HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/analysis
Current dev branch:
$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/cash-flow HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None