backend
backend copied to clipboard
article text repeated numerous times in some stories
I'm seeing a that stories from the LA Times have the content of the story repeated multiple times in the same story object in our database. This is a data quality issue, but not one that will effect most research at all. I noticed this because I was extracting quotes for a project and saw the same quote showing up numerous times in a some stories. I thought my pipeline was broken, but in fact the text we have from the story has the entire article text repeated multiple times.
The handful of examples I checked were all from the LA Times. For example:
- the text we have for story 1558515977 has the quote "Now is the time to prepare for the possibility of widespread community transmission" in it 25 times, while the original story does not
- the text we have for story 1539434634 has the quote "at annoys me is their press re" in it 30 times, while the original story does not
I'm just noting this to track it in case it causes problems for other projects that rely on custom processing of cached raw text.
for some reason, there are 25 downloads for that first story.
should be easy to write a script to check for this for la times and other sources and fix it. harder to figure out why in the heck we downloaded the same story 25 times.
-hal
On Tue, May 12, 2020 at 8:31 AM rahulbot [email protected] wrote:
I'm seeing a that stories from the LA Times have the content of the story repeated multiple times in the same story object in our database. This is a data quality issue, but not one that will effect most research at all. I noticed this because I was extracting quotes for a project and saw the same quote showing up numerous times in a some stories. I thought my pipeline was broken, but in fact the text we have from the story has the entire article text repeated multiple times.
The handful of examples I checked were all from the LA Times. For example:
- the text we have for story 1558515977 has the quote "Now is the time to prepare for the possibility of widespread community transmission" in it 25 times, while the original story does not https://urldefense.proofpoint.com/v2/url?u=https-3A__www.latimes.com_california_story_2020-2D03-2D25_coronavirus-2Dwhy-2Dsanta-2Dclara-2Dbecame-2Dcalifornia-2Depicenter-2Dpandemic-3Futm-5Fsource-3Dfeedburner-26utm-5Fmedium-3Dfeed-26utm-5Fcampaign-3DFeed-253A-2Blatimes-252Fbusiness-2B-2528L.A.-2BTimes-2B-2D-2BBusiness-2529&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=SnY8hmIw3Lo8vKNup5d7bVXDE6Ea_k-14i-qVbz1AEk&e=
- the text we have for story 1539434634 has the quote "at annoys me is their press re" in it 30 times, while the original story does not https://urldefense.proofpoint.com/v2/url?u=https-3A__www.latimes.com_california_story_2020-2D03-2D05_coronavirus-2Dca-2Dprincess-2Dcruise-2Dcontainment&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=4mOZ1gapNfefkR-8eet4YRNZkHtmxWei1iO3VhG_tHY&e=
I'm just noting this to track it in case it causes problems for other projects that rely on custom processing of cached raw text.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_705&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=i4pmPtAmtEv3U46M2NgTId1MRXq8i74gQcERQFfu4ak&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2HJWYEQHY3T7A4VN3RRFFSJANCNFSM4M62QTLQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=E1NttBOBLO2UWRJh4zpdkbHFhiGeAa8HvMWa2CYFGcw&e= .