libraries.io icon indicating copy to clipboard operation
libraries.io copied to clipboard

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk

Open KOLANICH opened this issue 3 years ago • 4 comments

The screenshots are from LibreOffice, but other software sees the data as messed too.

a c

Also some rows contain junk:

b

projects file seems to be OK.

Other files haven't been tested.

KOLANICH avatar Nov 22 '21 17:11 KOLANICH

The reason for the issue you refer to is a simple shift problem, at least for the projects_with_repository_fields file which can be resolved by simply loading the file into a Pandas or Dask dataframe (in Python) with index_col=False attribute or any equivalent of this behavior in other languages.

ftarlaci avatar Dec 30 '21 16:12 ftarlaci

The problem is that they are not uniformly shifted. Some lines are shifted by one amount, another lines by another amount, so for different lines the same colums contain different data (at least as exploration in LO Calc has showed) and to fix the data nontrivial logic is needed, which will likely won't work reliably. So the data is completely junk.

Also, I am not going to use pandas, pandas is damn slow. I gonna use a custom importer in C++ using Ben Strasser's fastest CSV parsing lib (the schema is defined in compile time).

KOLANICH avatar Dec 30 '21 20:12 KOLANICH

Well, isn't that beauty of open source; you work on making it better if you can? Anyways, I would like to leave you with one of my favorite quotes: "Everyone in open source is doing everyone else a favor to varying levels of commitment. We should treat one another accordingly.”

Good luck.

ftarlaci avatar Jan 03 '22 19:01 ftarlaci

You are right. But I am out of capacity to work on this project too. In fact I am not even sure that these datasets gonna be useful for the study at all.

Good luck.

Thanks.

KOLANICH avatar Jan 04 '22 07:01 KOLANICH