cite-classifications-wiki icon indicating copy to clipboard operation
cite-classifications-wiki copied to clipboard

Duplicate rows found in the parent dataset

Open iosonopersia opened this issue 4 years ago • 0 comments

Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).

I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.

Those duplicated lines should be removed from the next edition of the dataset. As a suggestion, these lines of code could be used at some point during the workflow.

iosonopersia avatar Feb 01 '21 16:02 iosonopersia