backend icon indicating copy to clipboard operation
backend copied to clipboard

stories_public/list empty `story_tags` when more than 100 rows requested

Open pypt opened this issue 3 years ago • 2 comments

(Moved from #725.)

More confusingly - asking to page with more rows than 100 seems to make the story_tags disaster in results.

This code returns a story 105831 with story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0]

But this call, with rows=200 returns the same story with NO story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=200)[0]

pypt avatar Sep 29 '20 12:09 pypt

Prep:

>>> import mediacloud.api, json, datetime as dt
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> mc = mediacloud.api.MediaCloud('YOUR_KEY')
>>> tag_sets_id = mediacloud.tags.TAG_SET_NYT_THEMES_VERSION
>>> q = '*'
>>> fq = mc.dates_as_query_clause(dt.date(2020,8,20), dt.date(2020,8,24))

99 stories - story_tags looks okay:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=99)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': {   'stories_id': 1543287159,
                                             'tag': 'guess_by_unknown',
                                             'tag_set': 'date_guess_method',
                                             'tag_sets_id': 508,
                                             'tags_id': 50741492},
                    'extractor_version': {   'stories_id': 1543287159,
                                             'tag': 'readability-lxml-0.7',
                                             'tag_set': 'extractor_version',
                                             'tag_sets_id': 1354,
                                             'tags_id': 81092444},
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [   {   'stories_id': 1543287159,
                          'tag': 'guess_by_unknown',
                          'tag_set': 'date_guess_method',
                          'tag_sets_id': 508,
                          'tags_id': 50741492},
                      {   'stories_id': 1543287159,
                          'tag': 'readability-lxml-0.7',
                          'tag_set': 'extractor_version',
                          'tag_sets_id': 1354,
                          'tags_id': 81092444}],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

100 stories - story_tags looks okay:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': {   'stories_id': 1543287159,
                                             'tag': 'guess_by_unknown',
                                             'tag_set': 'date_guess_method',
                                             'tag_sets_id': 508,
                                             'tags_id': 50741492},
                    'extractor_version': {   'stories_id': 1543287159,
                                             'tag': 'readability-lxml-0.7',
                                             'tag_set': 'extractor_version',
                                             'tag_sets_id': 1354,
                                             'tags_id': 81092444},
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [   {   'stories_id': 1543287159,
                          'tag': 'guess_by_unknown',
                          'tag_set': 'date_guess_method',
                          'tag_sets_id': 508,
                          'tags_id': 50741492},
                      {   'stories_id': 1543287159,
                          'tag': 'readability-lxml-0.7',
                          'tag_set': 'extractor_version',
                          'tag_sets_id': 1354,
                          'tags_id': 81092444}],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

101 rows - story_tags is empty:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=101)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': None,
                    'extractor_version': None,
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

pypt avatar Sep 29 '20 12:09 pypt

I think this natatime() call could be the one to blame, but I can't figure out how:

https://github.com/mediacloud/backend/blob/12a0c0e896bb6979469248d43f4a7f1ec23c3d8e/apps/webapp-api/src/perl/MediaWords/Controller/Api/V2/StoriesBase.pm#L279-L302

pypt avatar Sep 29 '20 12:09 pypt