backend
backend copied to clipboard
Source metadata (e.g. end date and num_stories_X) incorrect
We've found a number of inconsistencies relating to end dates in source manager:
- https://github.com/mitmedialab/MediaCloud-Web-Tools/issues/1953
- https://github.com/mitmedialab/MediaCloud-Web-Tools/issues/1991
I think I've zeroed in on where the problem may lie. The mediaHealth
endpoint returns a different end date from storyCount
. For example:
mc.mediaHealth(1363086)
{'coverage_gaps': 1,
'coverage_gaps_list': [{'expected_sentences': 569.03,
'expected_stories': 2.69,
'media_id': 1363086,
'num_sentences': 102.71,
'num_stories': 3.0,
'stat_week': '2020-02-10 00:00:00-05:00'}],
'end_date': '2020-02-17 00:00:00-05:00',
'expected_sentences': 569.03,
'expected_stories': 2.69,
'has_active_feed': False,
'is_healthy': False,
'media_health_id': 1115839766,
'media_id': 1363086,
'num_sentences': 0,
'num_sentences_90': 0,
'num_sentences_w': 0,
'num_sentences_y': 224.09,
'num_stories': 0,
'num_stories_90': 0,
'num_stories_w': 0,
'num_stories_y': 1.44,
'start_date': '2019-02-11 00:00:00-05:00'}
fq='publish_day:[2010-01-01T00:00:00Z TO 2020-09-23T00:00:00Z]'
q='media_id:1363086 AND NOT tags_id_stories:8875452'
mc.storyCount(solr_query=q, solr_filter=fq, split=True)
{'counts': [{'count': 1, 'date': '2019-02-13 00:00:00'},
{'count': 1, 'date': '2019-02-15 00:00:00'},
{'count': 1, 'date': '2019-02-20 00:00:00'},
...
{'count': 9, 'date': '2020-09-18 00:00:00'},
{'count': 1, 'date': '2020-09-20 00:00:00'},
{'count': 3, 'date': '2020-09-21 00:00:00'},
{'count': 7, 'date': '2020-09-22 00:00:00'}]}
Note the end date in media health is 2020-02-17 00:00:00-05:00
and num_stories_90 = 0
and the final date in the split story count is 2020-09-22 00:00:00
.
the media_health data is generated by a daily cron job. it must not be running. I will take a look.
-hal
On Wed, Sep 23, 2020 at 9:27 AM Dennis Jen [email protected] wrote:
We've found a number of inconsistencies relating to end dates in source manager:
- mitmedialab/MediaCloud-Web-Tools#1953 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mitmedialab_MediaCloud-2DWeb-2DTools_issues_1953&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=tjgayoBbkIhi3Wk-w7S2ahzTuV4DA4OGr7PEW6KItcA&e=
- mitmedialab/MediaCloud-Web-Tools#1991 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mitmedialab_MediaCloud-2DWeb-2DTools_issues_1991&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=LOemADMdaKpasb93BObBINj2Iy9MrMuI54PWoHyaTd4&e=
I think I've zeroed in on where the problem may lie. The mediaHealth endpoint returns a different end date from storyCount. For example:
mc.mediaHealth(1363086) {'coverage_gaps': 1, 'coverage_gaps_list': [{'expected_sentences': 569.03, 'expected_stories': 2.69, 'media_id': 1363086, 'num_sentences': 102.71, 'num_stories': 3.0, 'stat_week': '2020-02-10 00:00:00-05:00'}], 'end_date': '2020-02-17 00:00:00-05:00', 'expected_sentences': 569.03, 'expected_stories': 2.69, 'has_active_feed': False, 'is_healthy': False, 'media_health_id': 1115839766, 'media_id': 1363086, 'num_sentences': 0, 'num_sentences_90': 0, 'num_sentences_w': 0, 'num_sentences_y': 224.09, 'num_stories': 0, 'num_stories_90': 0, 'num_stories_w': 0, 'num_stories_y': 1.44, 'start_date': '2019-02-11 00:00:00-05:00'}
fq='publish_day:[2010-01-01T00:00:00Z TO 2020-09-23T00:00:00Z]' q='media_id:1363086 AND NOT tags_id_stories:8875452' mc.storyCount(solr_query=q, solr_filter=fq, split=True)
{'counts': [{'count': 1, 'date': '2019-02-13 00:00:00'}, {'count': 1, 'date': '2019-02-15 00:00:00'}, {'count': 1, 'date': '2019-02-20 00:00:00'}, ... {'count': 9, 'date': '2020-09-18 00:00:00'}, {'count': 1, 'date': '2020-09-20 00:00:00'}, {'count': 3, 'date': '2020-09-21 00:00:00'}, {'count': 7, 'date': '2020-09-22 00:00:00'}]}
Note the end date in media health is 2020-02-17 00:00:00-05:00 and num_stories_90 = 0 and the final date in the split story count is 2020-09-22 00:00:00.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mediacloud_backend_issues_726&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=71FCpZsA6estjkVWRP3VgG8YesYFF4ZLJpPPSwJy7mc&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T33HPGXB56N7NDJBTDSHIAUDANCNFSM4RXDPYKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=2y3l5vxddWBFPIpBEjAYeyFMNQ8lkfaZ4rot7hf9urw&e= .
The media health job has finished running and seems to have caught the data up. Can you please take a look and tell me if it looks better now?
Sorry, I hit the endpoint and the dates still don't match up. 😢