backend icon indicating copy to clipboard operation
backend copied to clipboard

recent stories missing tags (not getting processed fully?)

Open rahulbot opened this issue 3 years ago • 2 comments

I noticed that recent stories don't have any tags on them. Perhaps some services aren't running as we transition?

q = '*'
fq = mc.dates_as_query_clause(dt.date(2020,8,20), dt.date(2020,8,24))
tag_sets_id = mediacloud.tags.TAG_SET_NYT_THEMES_VERSION
# all stories
total = mc.storyCount(q, fq)['count']
# stories with nyt themes
with_themes = sum([t['count'] for t in mc.storyTagCount(q, fq, tag_sets_id=tag_sets_id)])
"{:.2%} stories have been processed for themes".format(with_themes/total)

This prints out that just 36% of stories between 8/20 and 8/24 have been processed by the theme engine. Of course we can go back and reprocess them, but this will skew results people see in certain Explorer and Topic Mapper widgets.

If I run the same thing with mediacloud.tags.TAG_SET_GEOCODER_VERSION to see how many have been run through CLIFF, I get the same result - 36%.

rahulbot avatar Sep 03 '20 18:09 rahulbot

More confusingly - asking to page with more rows than 100 seems to make the story_tags disaster in results.

This code returns a story 105831 with story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0]

But this call, with rows=200 returns the same story with NO story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=200)[0]

rahulbot avatar Sep 03 '20 19:09 rahulbot

Moved the second complaint to #729.

pypt avatar Sep 29 '20 12:09 pypt