backend
backend copied to clipboard
import sources from newsapi.org
The commercial https://newsapi.org site appears to have ~30k sources categorized by country of publication and language. Might be worth scraping from their sources
endpoint and importing that metadata for any sources we don't have.
https://newsapi.org/docs/endpoints/sources
Hi @rahulbot! I want to start contributing with this issue. But I am not able to find any documentation related to the codebase. Could you please guide on how can I proceed with this? Thanks!
Hello - thanks for your offer. At a high level, this task involves fetching all the news sites listed on NewsAPI into a CSV file. After that we can review the CSV for import into our system. That plan splits the code nicely. A rough outline would look like this:
- Create a new
fetch-news-api-sources
repository - Install the unofficial Python API client and sign up for their free API tier (that gives you 100 API hits per day)
- Read about their
sources
API endpoint, which lists "top" sources within their system - Write a script that calls that endpoint via that unofficial API client and saves results to a CSV file
- Poke around and try to determine how to page through results, or make multiple calls to see if you've fetched everything you can
Once that CSV file is in hand we can then format it correctly for ingest into our system (either via our API or our front-end source management tool). That would help us add any news sources we don't have already that are considered "top news".
Hi @rahulbot, as you've mentioned above a rough outline I implemented it here. What else would be needed for this issue?
Thanks
Great work! Your spreadsheet has about 100 sources, but their homepage says they track "75,000 worldwide sources". I note that sources
endpoint says it returns "subset of news publishers that top headlines are available from". Do you suppose they only return "top headings" from these ~100 sources? Is there any way to download their larger list of 75k sources via a different endpoint? This short list is fine, but really not as helpful as a longer list would be.
Maybe do something clever like search their everything
endpoint for each country name and then pull out source ids, names, and base domains from the urls of stories? They link in the documentation to this page as a "sources index" but I don't see a list of sources on there. Any creative ideas?
I have an idea
The news api provides the query search parameter q in their everything endpoint. the documentation states that
q Keywords or phrases to search for in the article title and body.
The idea is to use single letters in the search query like this api.get_everything(q='a')
and this returned me 18,25,471 results and we still have 25 more to go. Also there are numbers too. api.get_everything(q='1')
returned 6,51,390
Problem: api calls are restricted to 1000 per day on a single key and in developer plan we only access first 100 results only. But a good thing is they allow feature to exclude domains
excludeDomains A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to remove from the results.
Now in exclude domains we can put the previous extracted sources and the new ones. By doing this we'll be getting 100 different sources for 5 api calls. I think after 1000 calls we'll be having 20,000 sources and within 3-4 days or using 3-4 new api keys we'll be having all the sources they've got.
@rahulbot What do you think?
Good thinking - but remember that lists stories not media sources... so there will be multiple stories per media source - ie. each call would return somewhere between 1 and 100 media sources. Searching by character is a creative approach, and your idea of using excludeDomains
is very clever. I'd say try it and see how many media sources you average per call. I'm not sure you even need the character-based filtering - you could just iterate over sources you have already, but I'm sure there is a max length to exludeDomains
. Can you filter by country in that endpoint? That might help (if you iterate over countries) because then the list of media would be smaller for each country.
Yes you're right q is parameter is not necessary. I'll give a try and let you know how much sources I was able to extract and if reached max domains length will try to use country filter on that point and keep a set of unique sources.
I just have a small question @rahulbot are you in charge or mentor for GSoC 2021 because I shared a proposal and didn't get any feedback or response. And it would be very helpful to get some suggestions/feedback on the proposal.
Thanks
Got ~40 new sources then
{'status': 'error', 'code': 'rateLimited', 'message': 'You have made too many requests recently. Developer accounts are limited to 100 requests over a 24 hour period (50 requests available every 12 hours). Please upgrade to a paid plan if you need more requests.'}
.
We've to figure out a new way, new_keys/upgrade_plan or using temporary email providers like this one (this worked and got new key instantly)
Extracted 5,731 different sources using keywords(3000) filter. URL length reaches ~16,000 (including domain names(which is to be removed) ) and response on GET is Request URL Too Long. HTTP Error 414. The request URL is too long
. And also there is something wrong with their everything
endpoint because even after adding exclude_domains
the excluded ones were still appearing on the result.
I think a larger list of keywords (q or qInTitle) would help. Code is here and sources What next @rahulbot