backend icon indicating copy to clipboard operation
backend copied to clipboard

import sources from newsapi.org

Open rahulbot opened this issue 5 years ago • 10 comments

The commercial https://newsapi.org site appears to have ~30k sources categorized by country of publication and language. Might be worth scraping from their sources endpoint and importing that metadata for any sources we don't have.

https://newsapi.org/docs/endpoints/sources

rahulbot avatar Oct 29 '18 17:10 rahulbot

Hi @rahulbot! I want to start contributing with this issue. But I am not able to find any documentation related to the codebase. Could you please guide on how can I proceed with this? Thanks!

YashJipkate avatar Mar 14 '21 14:03 YashJipkate

Hello - thanks for your offer. At a high level, this task involves fetching all the news sites listed on NewsAPI into a CSV file. After that we can review the CSV for import into our system. That plan splits the code nicely. A rough outline would look like this:

  1. Create a new fetch-news-api-sources repository
  2. Install the unofficial Python API client and sign up for their free API tier (that gives you 100 API hits per day)
  3. Read about their sources API endpoint, which lists "top" sources within their system
  4. Write a script that calls that endpoint via that unofficial API client and saves results to a CSV file
  5. Poke around and try to determine how to page through results, or make multiple calls to see if you've fetched everything you can

Once that CSV file is in hand we can then format it correctly for ingest into our system (either via our API or our front-end source management tool). That would help us add any news sources we don't have already that are considered "top news".

rahulbot avatar Mar 15 '21 01:03 rahulbot

Hi @rahulbot, as you've mentioned above a rough outline I implemented it here. What else would be needed for this issue?

Thanks

Spectre-ak avatar Apr 02 '21 09:04 Spectre-ak

Great work! Your spreadsheet has about 100 sources, but their homepage says they track "75,000 worldwide sources". I note that sources endpoint says it returns "subset of news publishers that top headlines are available from". Do you suppose they only return "top headings" from these ~100 sources? Is there any way to download their larger list of 75k sources via a different endpoint? This short list is fine, but really not as helpful as a longer list would be.

Maybe do something clever like search their everything endpoint for each country name and then pull out source ids, names, and base domains from the urls of stories? They link in the documentation to this page as a "sources index" but I don't see a list of sources on there. Any creative ideas?

rahulbot avatar Apr 02 '21 14:04 rahulbot

I have an idea

The news api provides the query search parameter q in their everything endpoint. the documentation states that q Keywords or phrases to search for in the article title and body.

The idea is to use single letters in the search query like this api.get_everything(q='a') and this returned me 18,25,471 results and we still have 25 more to go. Also there are numbers too. api.get_everything(q='1') returned 6,51,390

Problem: api calls are restricted to 1000 per day on a single key and in developer plan we only access first 100 results only. But a good thing is they allow feature to exclude domains

excludeDomains A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to remove from the results.

Now in exclude domains we can put the previous extracted sources and the new ones. By doing this we'll be getting 100 different sources for 5 api calls. I think after 1000 calls we'll be having 20,000 sources and within 3-4 days or using 3-4 new api keys we'll be having all the sources they've got.

@rahulbot What do you think?

Spectre-ak avatar Apr 02 '21 15:04 Spectre-ak

Good thinking - but remember that lists stories not media sources... so there will be multiple stories per media source - ie. each call would return somewhere between 1 and 100 media sources. Searching by character is a creative approach, and your idea of using excludeDomains is very clever. I'd say try it and see how many media sources you average per call. I'm not sure you even need the character-based filtering - you could just iterate over sources you have already, but I'm sure there is a max length to exludeDomains. Can you filter by country in that endpoint? That might help (if you iterate over countries) because then the list of media would be smaller for each country.

rahulbot avatar Apr 02 '21 16:04 rahulbot

Yes you're right q is parameter is not necessary. I'll give a try and let you know how much sources I was able to extract and if reached max domains length will try to use country filter on that point and keep a set of unique sources.

I just have a small question @rahulbot are you in charge or mentor for GSoC 2021 because I shared a proposal and didn't get any feedback or response. And it would be very helpful to get some suggestions/feedback on the proposal.

Thanks

Spectre-ak avatar Apr 02 '21 16:04 Spectre-ak

Got ~40 new sources then {'status': 'error', 'code': 'rateLimited', 'message': 'You have made too many requests recently. Developer accounts are limited to 100 requests over a 24 hour period (50 requests available every 12 hours). Please upgrade to a paid plan if you need more requests.'}.

We've to figure out a new way, new_keys/upgrade_plan or using temporary email providers like this one (this worked and got new key instantly)

Spectre-ak avatar Apr 02 '21 19:04 Spectre-ak

Extracted 5,731 different sources using keywords(3000) filter. URL length reaches ~16,000 (including domain names(which is to be removed) ) and response on GET is Request URL Too Long. HTTP Error 414. The request URL is too long. And also there is something wrong with their everything endpoint because even after adding exclude_domains the excluded ones were still appearing on the result.

I think a larger list of keywords (q or qInTitle) would help. Code is here and sources What next @rahulbot

Spectre-ak avatar Apr 03 '21 19:04 Spectre-ak

Fixed?

temp mail address

sodlouz avatar Aug 31 '22 21:08 sodlouz