backend icon indicating copy to clipboard operation
backend copied to clipboard

Source metadata (e.g. end date and num_stories_X) incorrect

Open dsjen opened this issue 3 years ago • 3 comments

We've found a number of inconsistencies relating to end dates in source manager:

  • https://github.com/mitmedialab/MediaCloud-Web-Tools/issues/1953
  • https://github.com/mitmedialab/MediaCloud-Web-Tools/issues/1991

I think I've zeroed in on where the problem may lie. The mediaHealth endpoint returns a different end date from storyCount. For example:

mc.mediaHealth(1363086)
{'coverage_gaps': 1,
 'coverage_gaps_list': [{'expected_sentences': 569.03,
   'expected_stories': 2.69,
   'media_id': 1363086,
   'num_sentences': 102.71,
   'num_stories': 3.0,
   'stat_week': '2020-02-10 00:00:00-05:00'}],
 'end_date': '2020-02-17 00:00:00-05:00',
 'expected_sentences': 569.03,
 'expected_stories': 2.69,
 'has_active_feed': False,
 'is_healthy': False,
 'media_health_id': 1115839766,
 'media_id': 1363086,
 'num_sentences': 0,
 'num_sentences_90': 0,
 'num_sentences_w': 0,
 'num_sentences_y': 224.09,
 'num_stories': 0,
 'num_stories_90': 0,
 'num_stories_w': 0,
 'num_stories_y': 1.44,
 'start_date': '2019-02-11 00:00:00-05:00'}
fq='publish_day:[2010-01-01T00:00:00Z TO 2020-09-23T00:00:00Z]'
q='media_id:1363086 AND NOT tags_id_stories:8875452'
mc.storyCount(solr_query=q, solr_filter=fq, split=True)

{'counts': [{'count': 1, 'date': '2019-02-13 00:00:00'},
  {'count': 1, 'date': '2019-02-15 00:00:00'},
  {'count': 1, 'date': '2019-02-20 00:00:00'},
  ...
  {'count': 9, 'date': '2020-09-18 00:00:00'},
  {'count': 1, 'date': '2020-09-20 00:00:00'},
  {'count': 3, 'date': '2020-09-21 00:00:00'},
  {'count': 7, 'date': '2020-09-22 00:00:00'}]}

Note the end date in media health is 2020-02-17 00:00:00-05:00 and num_stories_90 = 0 and the final date in the split story count is 2020-09-22 00:00:00.

dsjen avatar Sep 23 '20 14:09 dsjen

the media_health data is generated by a daily cron job. it must not be running. I will take a look.

-hal

On Wed, Sep 23, 2020 at 9:27 AM Dennis Jen [email protected] wrote:

We've found a number of inconsistencies relating to end dates in source manager:

I think I've zeroed in on where the problem may lie. The mediaHealth endpoint returns a different end date from storyCount. For example:

mc.mediaHealth(1363086) {'coverage_gaps': 1, 'coverage_gaps_list': [{'expected_sentences': 569.03, 'expected_stories': 2.69, 'media_id': 1363086, 'num_sentences': 102.71, 'num_stories': 3.0, 'stat_week': '2020-02-10 00:00:00-05:00'}], 'end_date': '2020-02-17 00:00:00-05:00', 'expected_sentences': 569.03, 'expected_stories': 2.69, 'has_active_feed': False, 'is_healthy': False, 'media_health_id': 1115839766, 'media_id': 1363086, 'num_sentences': 0, 'num_sentences_90': 0, 'num_sentences_w': 0, 'num_sentences_y': 224.09, 'num_stories': 0, 'num_stories_90': 0, 'num_stories_w': 0, 'num_stories_y': 1.44, 'start_date': '2019-02-11 00:00:00-05:00'}

fq='publish_day:[2010-01-01T00:00:00Z TO 2020-09-23T00:00:00Z]' q='media_id:1363086 AND NOT tags_id_stories:8875452' mc.storyCount(solr_query=q, solr_filter=fq, split=True)

{'counts': [{'count': 1, 'date': '2019-02-13 00:00:00'}, {'count': 1, 'date': '2019-02-15 00:00:00'}, {'count': 1, 'date': '2019-02-20 00:00:00'}, ... {'count': 9, 'date': '2020-09-18 00:00:00'}, {'count': 1, 'date': '2020-09-20 00:00:00'}, {'count': 3, 'date': '2020-09-21 00:00:00'}, {'count': 7, 'date': '2020-09-22 00:00:00'}]}

Note the end date in media health is 2020-02-17 00:00:00-05:00 and num_stories_90 = 0 and the final date in the split story count is 2020-09-22 00:00:00.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mediacloud_backend_issues_726&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=71FCpZsA6estjkVWRP3VgG8YesYFF4ZLJpPPSwJy7mc&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T33HPGXB56N7NDJBTDSHIAUDANCNFSM4RXDPYKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=2y3l5vxddWBFPIpBEjAYeyFMNQ8lkfaZ4rot7hf9urw&e= .

hroberts avatar Sep 23 '20 16:09 hroberts

The media health job has finished running and seems to have caught the data up. Can you please take a look and tell me if it looks better now?

hroberts avatar Sep 25 '20 14:09 hroberts

Sorry, I hit the endpoint and the dates still don't match up. 😢

dsjen avatar Sep 25 '20 14:09 dsjen