yfinance icon indicating copy to clipboard operation
yfinance copied to clipboard

Simple ticker request, causes the get_json() requests to loop 3 times

Open eabase opened this issue 3 years ago • 12 comments

A simple ticker request (for example, using a line like this:

python -c "import yfinance as y;t=y.Ticker('TCEHY');print(t.cashflow);"

Causes the get_json() (requests) function to be called 3 times!

There is no reason why this should need to be looped 3 times. Everything we need can (? or should) be gotten from one request, if it is called with the correct URL parameters.

eabase avatar Jan 02 '22 01:01 eabase

@eabase i think that either there’s a cache mechanism such that effectively only one full fetch from web is done, Or, that 3 different datas are fetched (info, then additional data as the user requires in subsequent calls after the .Ticker())

I don’t thing there’s unnecessary redundancy here Though if you Are correct - a lot of time can be saved (in my scanner I fetch all nasdaq symbols one by one so cutting this time by like 1/3 sounds terrific)

can you elaborate on where these 3 get_json calls are made exactly and I’ll take a deeper look?

python -c "import yfinance as y;t=y.Ticker('TCEHY');print(t.cashflow);"

asafravid avatar Jan 02 '22 05:01 asafravid

@asafravid Just put a print statement in the get_json() and you'll see.

eabase avatar Jan 02 '22 21:01 eabase

@eabase checking Will update findings

asafravid avatar Jan 02 '22 22:01 asafravid

@eabase -> typo its get_json() (minor comment)

asafravid avatar Jan 02 '22 22:01 asafravid

@eabase yes its what I mentioned:

3 different datas are fetched (info, then additional data as the user requires in subsequent calls after the .Ticker())

3 different urls are fetched:

    ticker_url = "{}/{}".format(self._scrape_url, self.ticker)

    # get info and sustainability
    data = utils.get_json(ticker_url, proxy, self.session)

Then

    # get fundamentals
    data = utils.get_json(ticker_url + '/financials', proxy, self.session)

And then

    # Analysis
    data = utils.get_json(ticker_url + '/analysis', proxy, self.session)

And more so

    ticker_url = "{}/{}".format(self._scrape_url, self.ticker)

    # get info and sustainability
    data = utils.get_json(ticker_url, proxy, self.session)

So I don't think it's redundancy

Your thoughts?

asafravid avatar Jan 02 '22 22:01 asafravid

@asafravid No need to run 3 different URL's Put this in you browser:

https://query2.finance.yahoo.com/v11/finance/quoteSummary/FE?lang=en&region=US&modules=assetProfile%2CsummaryProfile%2CsummaryDetail%2CesgScores%2Cprice%2CincomeStatementHistory%2CincomeStatementHistoryQuarterly%2CbalanceSheetHistory%2CbalanceSheetHistoryQuarterly%2CcashflowStatementHistory%2CcashflowStatementHistoryQuarterly%2CdefaultKeyStatistics%2CfinancialData%2CcalendarEvents%2CsecFilings%2CrecommendationTrend%2CupgradeDowngradeHistory%2CinstitutionOwnership%2CfundOwnership%2CmajorDirectHolders%2CmajorHoldersBreakdown%2CinsiderTransactions%2CinsiderHolders%2CnetSharePurchaseActivity%2Cearnings%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend

It will give you the following json (need to unzip): FE.zip

However, I couldn't check how many requests it makes when using browser. (Please check!) When I did the same thing using curl (but with the API key URL), and a header, it ended up doing a request for each of the 30 modules. So I immediately ran out of free requests (100) with the error in the json as: {"message":"Limit Exceeded"}. (We should check for this!)

Then I added the keep-alive to the request header and it seem to have only given me one, but can't test more for the next 24hr.

If using using our request definition in the get_json(), it seem that we should add this header: "Connection: keep-alive" when doing this one. For other single use requests, I think it should be close, but I don't know, because IDK under what conditions the connection is closed.


Looking at all the 30 modules, it is clear that we usually only need 1 or 2 at the time. This would save considerable time and bandwidth as the full 30 take ~10-12 sec. to complete. Therefore I suggest to build a lookup table with all the items in the full request, and map each of them to it's respective module, and then form the request URL (with the needed modules) from the lookup table.


I now just noticed that old the json returned from the HTML scrape seem 4x larger than the one above, because it contains all the descriptions and all sorts of additional webpage url and link info.

Compare the above with the html, here obtained by this curl:

curl -s -A "curl/7.55.1" -H "Accept: application/json" -H "Content-Type: application/json" https://finance.yahoo.com/quoteSummary/FE/key-statistics?p={FE} | grep "root.App.main = " | sed -e "s/root.App.main = //" |sed -e 's/.$//' >from_html.json

eabase avatar Jan 03 '22 00:01 eabase

Can you please list the 30 modules? I’m not sure getting all of them at once (albeit a single request) will take less than 3 or 4 modules (consider 8000 stocks/symbols scanned one by one)

if you can provide code snippet to implement the LUT (maybe some dict?) I can check if it takes less than the current implementation Surely these are 3 different web pages right? You are saying I can get all that data at once? What is the url for it? I need to see some code which implements your suggested single call +LUT, which we can take into the yfinance code and I’ll test it’s performance vs current implementation

asafravid avatar Jan 03 '22 05:01 asafravid

Hi @asafravid

Can you please list the 30 modules?

They're all listed in the 1st URL in my previous post, separated by %2C (",").

if you can provide code snippet to implement the LUT

Unfortunately I don't have time to look into this at the moment, and it was just a suggestion how to (possibly?) do it better. It's probably a bit tedious to track down all the paths and put it in a LUT, but should be doable in an afternoon. You could even construct this automatically by downloading it the very first time (once in day?) and use selected parts in subsequent requests.


What is more disturbing is that there is a complete lack of error checking for the request made in that function, and that there is no switch to be able to get at least a minimal amount of debug info, such as the request header made or returned. A good start would be to add all the request exceptions shown here, check for limit exceeded and the API key issues, in case someone is using that. (Because many issues here, seem related to throttling.)


It seem to me the issue here, is really that you insist on scraping a "webpage", that just happen to have a json part embedded in it ("root.App.main = "), instead of requesting json directly using correct headers and properly constructed URL's in the requests. We should probably implement both, separately. 1st try to get it from the json request, and 2nd get other missing (if any) stuff from the html (webscrape) with embedded json, as is currently done.

eabase avatar Jan 03 '22 17:01 eabase

@eabase I see Need to measure / compare the results of doing 30 at once or 3 x specific urls... Hopefully sometime in the future (or someone else can help) measure/check this Might indeed be the cause for the throttling. @eabase can you link the # of the throtteling issues? that way your important insights above could be related to a certain fix for these throttling issues

asafravid avatar Jan 03 '22 19:01 asafravid

I'll pin this issue, maybe someone with relevant skills has time to implement an improvement.

ValueRaider avatar Oct 28 '22 14:10 ValueRaider

@fredrik-corneliusson Does your recent work address this?

ValueRaider avatar Nov 16 '22 16:11 ValueRaider

@ValueRaider Yes is should. However the v0.2 uses another way of fetching the data (and the output has changed) so it cannot be compared directly. Here are the different results for v0.1, v0.2rc2 and the dev branch with my changes (I shortened the URLs with lots of parameters): V0.1.87

$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/holders HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query1.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query1.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY... HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/financials HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/analysis HTTP/1.1" 200 None

v0.2rc2

$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/holders HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query1.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query1.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY&type=trailingPegRatio&period1=1652910095&period2=1668724895 HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/financials HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/balance-sheet HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY?symbol=TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/cash-flow HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/analysis

Current dev branch:

$ python -c "import yfinance as y, logging as l;l.basicConfig(level=l.DEBUG);t=y.Ticker('TCEHY');print(t.cashflow);"
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://finance.yahoo.com:443 "GET /quote/TCEHY/cash-flow HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): query2.finance.yahoo.com:443
DEBUG:urllib3.connectionpool:https://query2.finance.yahoo.com:443 "GET /ws/fundamentals-timeseries/v1/finance/timeseries/TCEHY... HTTP/1.1" 200 None

fredrik-corneliusson avatar Nov 16 '22 22:11 fredrik-corneliusson