backend icon indicating copy to clipboard operation
backend copied to clipboard

add support for new associated press api

Open hroberts opened this issue 5 years ago • 24 comments

We just renewed access to the associated press feed, but they are no longer giving access to their old, rss based feed. Instead, we have to ingest their custom api.

We should build this as a python module mediawords.crawler.download.feed.ap that has a single function called get_new_stories() or similar that goes to the api, downloads the list of the latest 100 stories, then downloads the content of each of those stories, then returns a list of the stories as a list of dicts.

This task only needs to do the work to ingest and format the data. We have a somewhat complicated system on the perl side that does all of the crawler scheduling and integration. We will assign as a separate task the little bit of work to write the little bit of perl glue code that will do the remaining work to stick the resulting data into the database.

The python module should return list of dicts with the following fields: url, publish_date, title, description, text. After ten minutes of poking at the api docs, I think we will have to first call the /feeds end point to get the list of most recent stories and then fetch a separate url for each story to get the actual content for each story.

For rss feeds, the crawler works on a model of fetching the rss feed as one job and then inserting any new stories as new job downloads. That would be a pain in this case because the ap content downloads would require code to handle api authentication and parsing of the custom ap content format. I think it will be much simpler to have the single get_new_stories() function just download each of the new pieces of content in serial. We should only get a couple hundred stories a day from the ap, so it should be quick to just download the handful we might get for a given feed download.

One problem we ran into with the previous ap integration is that the ap constantly updates its stories for the first few hours after they are published. Our system is not built to handle constant changes to stories (or to have different versions of stories). The old rss feed based system handled this problem by waiting 12 hours before actually downloading any of the urls found in the ap rss feed. We can't do that if we are downloading stories as we discover them, but maybe we can just ask the api for a feed of stories that are 12 hours old?

I will send the api instructions and authentication separately.

hroberts avatar May 15 '19 18:05 hroberts

Thanks. I'm reviewing the API documentation now. Data ingest is my specialty. :)

When you import the module and instantiate an instance, I'm assuming it would make sense for us to put some parameters that can be passed such as one for the API key itself, max/min story time, etc. I'll continue reading up on the API docs and incorporate your specs into the module.

pushshift avatar May 15 '19 19:05 pushshift

You should put the api key into the configuration file (mediawords.yml) and access it via mediawords.util.config.get_config().

You should just return all of the stories each time. I will deal with deduping the stories with the existing ap stories in the database on the perl side.

I'm not sure what to do for the url field. I don't think the api will return a public url, but maybe you can figure out how to construct an apnews.com url out of one of the identifiers in the api response?

-hal

hroberts avatar May 15 '19 19:05 hroberts

I'm not sure what to do for the url field. I don't think the api will return a public url, but maybe you can figure out how to construct an apnews.com url out of one of the identifiers in the api response?

I'll take a look at that and see if it is possible.

pushshift avatar May 15 '19 19:05 pushshift

we also have access to associated press support folks if we need to ask questions.

hroberts avatar May 15 '19 19:05 hroberts

My only concern here is that if you want the script to collect all available stories from the feed and don't supply any type of minimum date, there is a possibility of the calls taking a long time if the feed goes back 30+ days. We also have a rate limit on that endpoint so after so many requests, we would have to throttle and wait for a new rate limit window.

I'm looking at the API now and will have more info on how far back it goes, but providing a min cutoff might prevent huge requests each time.

We could also just put in a default look-back window as well.

Edit: I just noticed again your original message said 100 stories. We can go back further than this if needed. Do you just want to keep it at 100 and you handle deduping? What if we needed to go back further? (This API endpoint has a scrolling feature)

pushshift avatar May 15 '19 19:05 pushshift

just get the max 100 you can get from a single page. the crawler will be running this function up to every five minutes depending on how often it returns new stories, so we don't have to delve far back into history. the AP is giving us a separate dump of the stories we missed when our old subscription lapsed.

it's just occurring to me now that even returning 100 stories will be expensive because we have to go fetch each story. ideally we want to only fetch new stories. I think that means I need to pass in a database handle and let you query stories table to see whether a story with a given guid already exists.

here's a quick snippet of code that will query the stories table to figure out whether a story already exists with the given guid in the AP media source:

guid_exists = db.query("select 1 from stories s join media m using
(media_id) where m.name = 'AP' and s.guid = %(a)s", {'a': guid}).hash()

On Wed, May 15, 2019 at 2:54 PM Jason Michael Baumgartner < [email protected]> wrote:

My only concern here is that if you want the script to collect all available stories from the feed and don't supply any type of minimum date, there is a possibility of the calls taking a long time if the feed goes back 30+ days. We also have a rate limit on that endpoint so after so many requests, we would have to throttle and wait for a new rate limit window.

I'm looking at the API now and will have more info on how far back it goes, but providing a min cutoff might prevent huge requests each time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_585-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T3SWIAME55AQ7FY32TPVRS77A5CNFSM4HNFWVS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVPYQDQ-23issuecomment-2D492800014&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=wc9cFTLq19AOMQ80bAacERvRdrtFDWu0BgvdP2-ycic&s=bpwv706wgiMDNsijbAwju_NIs2IcaiOrz0QHWkjpPRA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T5ZDAULPFK6BBZYFNTPVRS77ANCNFSM4HNFWVSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=wc9cFTLq19AOMQ80bAacERvRdrtFDWu0BgvdP2-ycic&s=y7lBC0w5QJXHO8ai6LMeDBTGziYfn3Y_FYnrNb67B20&e= .

hroberts avatar May 15 '19 20:05 hroberts

I've set up a simple class to make API calls to the AP. I fetched one item from the feed. The fields returned are below. There is a uri returned with each item that is a link to another API endpoint to fetch actual content. This API endpoint actually gives a tremendous amount of data including regions associated with the story, categories and topics, etc. It also gives an actual url to the publicly available story. In this case, the url given was: https://apnews.com/88c67ea50aa14ed5aac9a1fc7d987455

Some remaining questions.

  1. Is the url above the url you need returned?

  2. What format do you want the publish_date in? (epoch seconds, ISO 8601, etc.)

  3. For title, description, text -- The API returns a title, headline and extended headline among other things. What is needed for description and text? Do you want to have the extended headline mapped to description? For text, are you looking for the article text itself? I'm assuming we're trying to replicate the fields normally supplied from an RSS feed so I can try and find appropriate fields that are a cogent substitute if those fields aren't available by name directly. I'll keep looking and exploring the API in the meantime.

  4. For guid comparison, I need to take a look at what id among itemid, etag and friendlykey we are presently using. I doubt it is the etag -- it is probably the itemid but I'll need to confirm.

{
  "api_version": "1.1",
  "api_mode": "live",
  "id": "r130QWc34F",
  "method": "/content/feed.GET",
  "org_name": "MIT Media Lab",
  "params": {
    "page_size": 1
  },
  "data": {
    "query": "",
    "updated": "2019-05-15T21:25:40.300Z",
    "current_item_count": 1,
    "next_page": "https://api.ap.org/media/v/content/feed?qt=r130QWc34F&seq=59924135",
    "items": [
      {
        "meta": {},
        "item": {
          "uri": "https://api.ap.org/media/v/content/88c67ea50aa14ed5aac9a1fc7d987455?qt=r130QWc34F&et=1a1aza0c0",
          "altids": {
            "itemid": "88c67ea50aa14ed5aac9a1fc7d987455",
            "etag": "88c67ea50aa14ed5aac9a1fc7d987455_1a1aza0c0",
            "friendlykey": "592204219666"
          },
          "version": 1,
          "type": "text",
          "versioncreated": "2019-05-15T21:20:51Z",
          "firstcreated": "2019-05-15T21:20:51Z",
          "pubstatus": "usable",
          "ednote": "Eds: Updates with details, quotes, background. Adds byline.",
          "signals": [
            "newscontent"
          ],
          "headline": "Crowds protest cuts in federal funding for Brazil schools",
          "bylines": [
            {
              "by": "By DIANE JEANTET",
              "title": "Associated Press"
            }
          ],
          "datelinelocation": {
            "city": "Rio De Janeiro",
            "countrycode": "BRA",
            "countryname": "Brazil",
            "geometry_geojson": {
              "type": "Point",
              "coordinates": [
                -43.18223,
                -22.90642
              ]
            }
          },
          "copyrightnotice": "Copyright 2019 The Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.",
          "usageterms": [
            "This content is intended for editorial use only. For other uses, additional clearances may be required."
          ],
          "provider": "AP",
          "infosource": [
            {
              "name": "AP",
              "type": "AP"
            }
          ],
          "renditions": {
            "nitf": {
              "title": "NITF Story Download",
              "rel": "Story",
              "format": "IIM",
              "type": "text",
              "mimetype": "text/xml",
              "fileextension": "xml",
              "words": 441,
              "contentid": "011d9107ebe74a62afd936688eaf4153",
              "href": "https://api.ap.org/media/v/content/88c67ea50aa14ed5aac9a1fc7d987455.1/download?type=text&format=NITF&rid=7d905dd31d3c4188b4b3942a211ad5af&cid=0&fid=011d9107ebe74a62afd936688eaf4153&trf=a1071&qt=r130QWc34F&dt=HJe2Am-chE&et=1a1aza0c0",
              "mediafilterid": "2"
            },
            "anpa": {
              "title": "ANPA Story Download",
              "rel": "Main",
              "format": "ANPA1312",
              "type": "text",
              "mimetype": "application/octet-stream",
              "fileextension": "anpa",
              "mediafilterid": "1",
              "words": 441,
              "contentid": "011d9107ebe74a62afd936688eaf4153",
              "href": "https://api.ap.org/media/v/content/88c67ea50aa14ed5aac9a1fc7d987455.1/download?type=text&format=ANPA&rid=7d905dd31d3c4188b4b3942a211ad5af&cid=0&fid=011d9107ebe74a62afd936688eaf4153&trf=a1071&qt=r130QWc34F&dt=HJe2Am-chE&et=1a1aza0c0"
            }
          }
        }
      }
    ]
  }
}

pushshift avatar May 15 '19 21:05 pushshift

The code is working well. Here is an example of the output it produces (it will give a list, but this is an example of one of the dicts within the list): https://jsonblob.com/880e2e11-778f-11e9-8a3e-13fd4c8d9c41

Let me know if the publish_date needs to be converted. Also, take a look at the content key -- I'm using the complete xml from the nitf content rendition. This may or may not be compatible with the existing raw XML the code / system expects since it may differ from the old AP RSS feed.

If that's the case, we can modify it as needed. I'm signing off for the evening but will pick up tomorrow late morning and clean up the code a bit and then commit to a branch under my name.

Thanks!

pushshift avatar May 16 '19 04:05 pushshift

this looks great.

On Wed, May 15, 2019 at 11:07 PM Jason Michael Baumgartner < [email protected]> wrote:

The code is working well. Here is an example of the output it produces (it will give a list, but this is an example of one of the dicts within the list): https://jsonblob.com/880e2e11-778f-11e9-8a3e-13fd4c8d9c41 https://urldefense.proofpoint.com/v2/url?u=https-3A__jsonblob.com_880e2e11-2D778f-2D11e9-2D8a3e-2D13fd4c8d9c41&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=t6qaaCXaXCo3P44ZMoSbvuELsLFQBzY4xAKgpu6NNwg&s=AIe-n9FJzOwWx8SLOZo5Ps5nDBN8sNT1njg1qb9556w&e=

Let me know if the publish_date needs to be converted. Also, take a look at the content key -- I'm using the complete xml from the nitf content rendition. This may or may not be compatible with the existing raw XML the code / system expects since it may differ from the old AP RSS feed.

If that's the case, we can modify it as needed. I'm signing off for the evening but will pick up tomorrow late morning and clean up the code a bit and then commit to a branch under my name.

Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_585-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T6W52TRUNLTOH7JE4TPVTMW3A5CNFSM4HNFWVS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVQTA6Q-23issuecomment-2D492908666&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=t6qaaCXaXCo3P44ZMoSbvuELsLFQBzY4xAKgpu6NNwg&s=H4C-DEFC3dDDWX41aq4wlbMz6s4aEGENI4nS1blOOtA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TYYBBB7NUTRNN6I6BLPVTMW3ANCNFSM4HNFWVSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=t6qaaCXaXCo3P44ZMoSbvuELsLFQBzY4xAKgpu6NNwg&s=7ee3bDx31mAchLy_CO6_KEpse_C0DsApaFDjnRXLG7A&e= .

hroberts avatar May 16 '19 14:05 hroberts

I did a lot of testing of the code yesterday and noticed a few minor issues. Occasionally, a story will have no url associated with it (these stories appear to be things for lotto numbers, etc.). In that situation, the code just skips that particular story. If you would rather have the story returned with an empty string for the url, I can change that behavior.

I also noticed that the extended_headline is sometimes missing. This maps to the description field. In that situation, the story is still returned, but the description key has an empty string for a value.

The code is currently here: https://github.com/pushshift/ap_story_fetcher/blob/master/fetch_associated_press.py

I'm going to rename it and put it in the appropriate place and remove the api key environment variable a use the method you specified. We should be able to start testing once that is done.

I did leave the db handle logic in place since if this is called every five minutes, it's likely that there will be duplicates between each fetch job so this will limit the amount of api calls by avoiding fetching the same stories repeatedly. The total time taken when the get_new_stories() method is called is generally 30 seconds or less before the stories are returned.

pushshift avatar May 17 '19 09:05 pushshift

Pull request for new code and the added associated_press section in the config file. https://github.com/berkmancenter/mediacloud/pull/588

pushshift avatar May 17 '19 10:05 pushshift

I fixed a minor issue with a line of code that should have been removed with the last commit. Tested locally and it works well and appears to handle all errors gracefully.

I was going to add the ability to paginate backwards based on if no guids were seen in the DB during the first pass of fetching 100 stories. Unfortunately, after adding the code and testing, I discovered the AP API does not appear to have a "previous_page" parameter for the feed endpoint as it does for the search endpoint. The "next_page" parameter for the feed endpoint actually opens a long-poll connection to await new stories that hit the feed.

After reviewing the documentation here (https://api.ap.org/media/v/docs/api/Search-and-Feed/#feed), I don't see any easy way to page backwards via the feed endpoint. If we have a support email address, it may be worth asking them if this feature is possible so that we can grab all stories from the feed if the next fetch is delayed for whatever reason.

It may be a situation where the feed endpoint only holds up to 100 stories and to go back further, the search endpoint would need to be used.

This latest commit should work. Please let me know if you encounter any issues. Passing a DB handle is still helpful to prevent previously seen stories from being refetched and returned, however -- so that's worth passing to the method.

I have a copy of the code which handles pagination and can commit it at a later point if we find out if the feed endpoint has the capability of going back further.

pushshift avatar May 20 '19 16:05 pushshift

From my preliminary review of the search endpoint, it looks very robust and appears capable of doing exactly what we need to ingest more history if needed. I'm comparing it to the feed endpoint to make sure that the fields needed are available but so far it looks promising.

pushshift avatar May 20 '19 17:05 pushshift

I am adding a search method to the AssociatedPressAPI class. Here is my suggestion to maximize the amount of content we pull:

When the get_new_stories method is called, the first call made will be to the feeds endpoint. Since the maximum amount of content is 100 stories, we can start with feeds API call. The next call can then be to the search endpoint using the descending sort option. The get_new_stories method will keep track of all guids seen so that it doesn't request content for guids already found via the feeds endpoint or guids already in the database.

We can then page backwards via the search endpoint until either the max_stories parameter is reached or until guids are found within the mediacloud database.

Calls to the feed and search endpoint are fast and inexpensive -- most of the time taken by the script is looking up the actual story content. Using this method, we should get the maximum publicly available stories with each fetch job and also have the ability to get older stories in the event that a future fetch job fails or there is a longer than normal delay between fetch jobs.

I've already added the search endpoint and I am now adding the logic to handle using that endpoint within the get_new_stories method.

pushshift avatar May 20 '19 18:05 pushshift

I have added support for the search endpoint and also added rate limit control to prevent going over the limit and having errors when a large request is made (max_stories= some large value).

I've also broken out the code to make it a bit easier to write tests against (which I'll be starting shortly). I am still working on some issues using the custom ua requests object -- I left the typical requests module code in place for the time being since that is working very well.

I also addressed some previous points that @hroberts made (fixed some type checking issues, etc.)

Sometimes when story content is requested, the API gives back a 403 (forbidden) response. In that situation, the story is skipped. If that behavior needs to be modified, let me know (the change would be fairly easy to make).

Also, if you would like a README or some basic documentation, please let me know and I can work on adding that along with the tests.

The changes can be reviewed here: https://github.com/berkmancenter/mediacloud/pull/588/commits/fede536b6ff55c2ab3e1ba554d8db166c127b82e

pushshift avatar May 23 '19 13:05 pushshift

nice work!

A couple things, with apologies for taking so long to fiddle to get the integration right. I think we're very close to being to plug this in.

  • I'm not sure the paging mechanism you describe is going to work well. I don't think there's a use case for 'give me up to 100 stories'. Instead, I think we will want to do one of two things: either get all of the stories that we can or get all of the stories we don't already have (arguably we just want the latter, since the only case for 'get all the stories' is when we don't have any stories yet). If the stories are consistently sorted in a way that this works, it would be great to say 'get all the stories until we find a story that we already have'. Alternatively, we could say 'get all of the stories until we find one that is at least 24 hours old', which should be a good equivalent assuming the stories are sorted by date. For any of these, it might make sense to have some sort of backstop of max requests so that the request doesn't go crazy (or it might not, if the requests are fast enough and the 30 day api limit is enough of a backstop). Let me know if I'm missing something here.

  • The default use for this is that we only want to collect stories that have a publish date that is at least 12 hours old. This is so that we can be reasonably sure that we have a final or near final version of each story. We can do this by telling the search end point that we want stories starting from 12 hours ago (I think). Or we can just collect all of the stories but only pass on the ones that have a >12 hour old publish date.

hroberts avatar May 23 '19 15:05 hroberts

@hroberts -- Let me make sure I understand and then suggest the best method from my point of view. First, the stories have a first created date and a version created date -- From working with the API, I believe the first created date reflects when the first version of the story was created. The version created date should reflect the most recent edit date.

Initially I thought going back and stopping when we find an id that we had previously seen would be the best approach, but then I realized that an older story could get knocked back to the top of the list because it was recently edited / had a new version. So stopping once we see a previous id might lead to situations where we stop prematurely.

Ideally I could compare the first created date for a story along with the ids from the database so I could get a sense for the max initial created date that we have in the database -- then I would know with confidence where we are in the timeline since using ids only can lead to some ambiguity.

  1. Do we store the story creation date in the DB itself (the first version date of the story) or do we store whatever date was present when we first encountered a new story (for instance, the first time we see a story that is currently version 5, etc.) If we have those dates in the DB, I would know with high confidence where about we should stop when moving backwards through stories.

  2. If we don't have easy access to those dates and we have to rely solely on previously seen ids, then another approach would be to keep paging back until X% of the ids we see are in the database. For instance, if I page back 6 times via the search API and 90% of the ids are in the database, we could reasonably assume we've gone back far enough. I'd like to say 100% but there may be situations where an older story in AP's system pops up that wasn't available in the past -- so I'm not sure if we could depend on stopping when we hit 100% id coverage (stopping at 100% would be ideal, I just don't know if we'd ever actually see all ids in our system if we paged back far enough).

Either way, making this modification to the script will be trivial since the bulk of the code to handle moving backwards is already there.

TL;DR: If we have access to dates and I can use the DB handle to see the max story time in the DB, that would probably be ideal. I could go back one further page just to make sure and then wrap that call up.

pushshift avatar May 23 '19 15:05 pushshift

One additional comment for a line of code:

publish_date = content['firstcreated'] # There is a first created date and a version created date (last edit datetime?)

Should we return the first created date or the version date for the story? I believe first created is the correct choice here?

pushshift avatar May 23 '19 15:05 pushshift

I think we should use the create date as our publish date. publish dates are impossible to get exactly right, so we just try our best to get it within a day. the edit date could potentially end a several days after the initial publication.

I think we should just keep paging back until we find a story with a date at least 24 hours old. for the purposes of deciding how far back to page, we should use whichever of the create date or the edited date is used to sort the results in the API response.

our entire crawler is based on the idea that we will collect RSS feeds often enough that we won't lose stories that scroll off the end. for very active feeds that means we collect them roughly every five minutes. if the crawler goes down at any point, we start seriously risking data loss after a couple of hours. if we ever have to have a longer downtime, we have the capability to run a backup crawler on AWS and then merge those stories into the core database later.

all of this means that our platform is already based on the idea of making sure we regularly collect each feed. so I think the more complicated approach you describe below is smart but would add more complexity than necessary. I think the cleaner solution is just to make the ap feed behave basically like an RSS feed with a day's worth of stories.

On Thu, May 23, 2019, 10:31 AM Jason Michael Baumgartner < [email protected]> wrote:

@hroberts https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hroberts&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=PNrvrre3eby5oRycgLMdzlJMhJmhrfT8XNxWiAD0DlE&s=7vbcLmBVXGBAnrkntv5BfOuXqeReltyi4uldeWN2_NQ&e= -- Let me make sure I understand and then suggest the best method from my point of view. First, the stories have a first created date and a version created date -- From working with the API, I believe the first created date reflects when the first version of the story was created. The version created date should reflect the most recent edit date.

Initially I thought going back and stopping when we find an id that we had previously seen would be the best approach, but then I realized that an older story could get knocked back to the top of the list because it was recently edited / had a new version. So stopping once we see a previous id might lead to situations where we stop prematurely.

Ideally I could compare the first created date for a story along with the ids from the database so I could get a sense for the max initial created date that we have in the database -- then I would know with confidence where we are in the timeline since using ids only can lead to some ambiguity.

Do we store the story creation date in the DB itself (the first version date of the story) or do we store whatever date was present when we first encountered a new story (for instance, the first time we see a story that is currently version 5, etc.) If we have those dates in the DB, I would know with high confidence where about we should stop when moving backwards through stories. 2.

If we don't have easy access to those dates and we have to rely solely on previously seen ids, then another approach would be to keep paging back until X% of the ids we see are in the database. For instance, if I page back 6 times via the search API and 90% of the ids are in the database, we could reasonably assume we've gone back far enough. I'd like to say 100% but there may be situations where an older story in AP's system pops up that wasn't available in the past -- so I'm not sure if we could depend on stopping when we hit 100% id coverage (stopping at 100% would be ideal, I just don't know if we'd ever actually see all ids in our system if we paged back far enough).

Either way, making this modification to the script will be trivial since the bulk of the code to handle moving backwards is already there.

TL;DR: If we have access to dates and I can use the DB handle to see the max story time in the DB, that would probably be ideal. I could go back one further page just to make sure and then wrap that call up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_585-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66TZTPPFIL5VU73JMXNDPW22E5A5CNFSM4HNFWVS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWCTG4I-23issuecomment-2D495268721&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=PNrvrre3eby5oRycgLMdzlJMhJmhrfT8XNxWiAD0DlE&s=iEFobfqiPQvkSh-9SSiCvIMzz2CbRvAxsuxNN1WjJoo&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TZ4H5QRTNYCZ2IH57LPW22E5ANCNFSM4HNFWVSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=PNrvrre3eby5oRycgLMdzlJMhJmhrfT8XNxWiAD0DlE&s=8T1BCkVWXQSXCrXyq7tZqPAcsH1cqmHxHD8kJ0nWJ5Y&e= .

hroberts avatar May 23 '19 16:05 hroberts

Sounds good. Thanks!

pushshift avatar May 23 '19 17:05 pushshift

I removed the max_stories logic and replaced it with max_lookback (seconds). It defaults to going back until it finds an item over the limit. I left this configurable in the rare circumstance that we would need to look back further than the default of 86,400 seconds. Since it has a default, it doesn't have to be set and will work as is.

When checking each item from the feed endpoint (which has a max of 100 items), I don't compare each item as they are processed to the max_lookback amount. I wait until the first entire batch is fetched and use the oldest value from that batch. The reasoning is that if there is a very old story that gets a revision, since it is using the original publish date, one of the first items in the feed list would be that very old item. This prevents the script from terminating too quickly (we'll essentially always get the feed stories and usually 100-200 search endpoint stories with the default lookback.

The script usually completes within a minute.

pushshift avatar May 23 '19 19:05 pushshift

that sounds great. once you get the unit tests done, I will integrate. excited to get this into production.

On Thu, May 23, 2019 at 2:45 PM Jason Michael Baumgartner < [email protected]> wrote:

I removed the max_stories logic and replaced it with max_lookback (seconds). It defaults to going back until it finds an item over the limit. I left this configurable in the rare circumstance that we would need to look back further than the default of 86,400 seconds. Since it has a default, it doesn't have to be set and will work as is.

When checking each item from the feed endpoint (which has a max of 100), I don't compare each item as they are processed to the max_lookback amount. I wait until the first entire batch is fetched and use the oldest value from that batch. The reasoning is that if there is a very old story that gets a revision, since it is using the original publish date, one of the first items in the feed list would be that very old item. This prevents the script from terminating too quickly (we'll essentially always get the feed stories and usually 100-200 search endpoint stories with the default lookback.

The script usually completes within a minute.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_585-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T3BOLUW72IRZF5VRYLPW3X6BA5CNFSM4HNFWVS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDJJZY-23issuecomment-2D495359207&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=R7DZ--cEIbaohIMUBk2ZPjbS3cJVxnL7Z01wxZ-6O9w&s=jhPu6ePq63Wb_7-bM3FzrSLvQPmlebRSgqhe9oqdwNI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T4TVAMQHDGJWHYGSUTPW3X6BANCNFSM4HNFWVSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=R7DZ--cEIbaohIMUBk2ZPjbS3cJVxnL7Z01wxZ-6O9w&s=BTTyD1q328m8vuFW36tCJEKAR89LBkFlNF9xL_ohnr4&e= .

hroberts avatar May 23 '19 21:05 hroberts

I have made a commit with some basic unit tests for the AP Fetcher including fixture data. I went through some examples in the current repo to get guidance and also used the coding best practices to construct the tests. I tested with nosetests3 where all five tests passed.

I'm not sure how you handle fixture data, so if you have a set location for fixture data for mocking, I can move the files there. There will probably need to be a few minor edits but hopefully this style parallels the type of unit tests already in place.

pushshift avatar May 28 '19 13:05 pushshift

I have added a get_and_add_new_stories function to the ap.py module and added a super simple crawl-ap.py daemon that just calls that function, sleeps for five minutes, then repeats forever.

I think all we need is the min_lookback option for get_new_stories, and we can turn this on live.

hroberts avatar Jun 06 '19 00:06 hroberts