backend icon indicating copy to clipboard operation
backend copied to clipboard

article text repeated numerous times in some stories

Open rahulbot opened this issue 4 years ago • 1 comments

I'm seeing a that stories from the LA Times have the content of the story repeated multiple times in the same story object in our database. This is a data quality issue, but not one that will effect most research at all. I noticed this because I was extracting quotes for a project and saw the same quote showing up numerous times in a some stories. I thought my pipeline was broken, but in fact the text we have from the story has the entire article text repeated multiple times.

The handful of examples I checked were all from the LA Times. For example:

  • the text we have for story 1558515977 has the quote "Now is the time to prepare for the possibility of widespread community transmission" in it 25 times, while the original story does not
  • the text we have for story 1539434634 has the quote "at annoys me is their press re" in it 30 times, while the original story does not

I'm just noting this to track it in case it causes problems for other projects that rely on custom processing of cached raw text.

rahulbot avatar May 12 '20 13:05 rahulbot

for some reason, there are 25 downloads for that first story.

should be easy to write a script to check for this for la times and other sources and fix it. harder to figure out why in the heck we downloaded the same story 25 times.

-hal

On Tue, May 12, 2020 at 8:31 AM rahulbot [email protected] wrote:

I'm seeing a that stories from the LA Times have the content of the story repeated multiple times in the same story object in our database. This is a data quality issue, but not one that will effect most research at all. I noticed this because I was extracting quotes for a project and saw the same quote showing up numerous times in a some stories. I thought my pipeline was broken, but in fact the text we have from the story has the entire article text repeated multiple times.

The handful of examples I checked were all from the LA Times. For example:

I'm just noting this to track it in case it causes problems for other projects that rely on custom processing of cached raw text.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_705&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=i4pmPtAmtEv3U46M2NgTId1MRXq8i74gQcERQFfu4ak&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2HJWYEQHY3T7A4VN3RRFFSJANCNFSM4M62QTLQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=Iqr-2uerohICzZPG6fX_LWvlqmuvJHKw1pJtSThhrRc&s=E1NttBOBLO2UWRJh4zpdkbHFhiGeAa8HvMWa2CYFGcw&e= .

hroberts avatar May 14 '20 14:05 hroberts