libraries.io
libraries.io copied to clipboard
The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk
The screenshots are from LibreOffice, but other software sees the data as messed too.
Also some rows contain junk:
projects
file seems to be OK.
Other files haven't been tested.
The reason for the issue you refer to is a simple shift problem, at least for the projects_with_repository_fields
file which can be resolved by simply loading the file into a Pandas or Dask dataframe (in Python) with index_col=False
attribute or any equivalent of this behavior in other languages.
The problem is that they are not uniformly shifted. Some lines are shifted by one amount, another lines by another amount, so for different lines the same colums contain different data (at least as exploration in LO Calc has showed) and to fix the data nontrivial logic is needed, which will likely won't work reliably. So the data is completely junk.
Also, I am not going to use pandas, pandas is damn slow. I gonna use a custom importer in C++ using Ben Strasser's fastest CSV parsing lib (the schema is defined in compile time).
Well, isn't that beauty of open source; you work on making it better if you can? Anyways, I would like to leave you with one of my favorite quotes: "Everyone in open source is doing everyone else a favor to varying levels of commitment. We should treat one another accordingly.”
Good luck.
You are right. But I am out of capacity to work on this project too. In fact I am not even sure that these datasets gonna be useful for the study at all.
Good luck.
Thanks.