backend
backend copied to clipboard
recent stories missing tags (not getting processed fully?)
I noticed that recent stories don't have any tags on them. Perhaps some services aren't running as we transition?
q = '*'
fq = mc.dates_as_query_clause(dt.date(2020,8,20), dt.date(2020,8,24))
tag_sets_id = mediacloud.tags.TAG_SET_NYT_THEMES_VERSION
# all stories
total = mc.storyCount(q, fq)['count']
# stories with nyt themes
with_themes = sum([t['count'] for t in mc.storyTagCount(q, fq, tag_sets_id=tag_sets_id)])
"{:.2%} stories have been processed for themes".format(with_themes/total)
This prints out that just 36% of stories between 8/20 and 8/24 have been processed by the theme engine. Of course we can go back and reprocess them, but this will skew results people see in certain Explorer and Topic Mapper widgets.
If I run the same thing with mediacloud.tags.TAG_SET_GEOCODER_VERSION
to see how many have been run through CLIFF, I get the same result - 36%.
More confusingly - asking to page with more rows than 100 seems to make the story_tags disaster in results.
This code returns a story 105831 with story_tags on it:
mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0]
But this call, with rows=200
returns the same story with NO story_tags on it:
mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=200)[0]
Moved the second complaint to #729.