vega-datasets icon indicating copy to clipboard operation
vega-datasets copied to clipboard

List all sources

Open domoritz opened this issue 9 years ago • 4 comments

domoritz avatar Oct 28 '16 14:10 domoritz

Seems like this has fallen by the wayside, but, should it ever come back to development: it would be cool if there were some elementary statistics for each of the datasets. Like how many rows of data, the names of the columns, the types of those columns, etc. Basically the same collection of things that kaggle lists for lots of the datasets on there

mcnuttandrew avatar Jun 19 '19 18:06 mcnuttandrew

It should be great to include the license of the source files as well.

santiagorr avatar Sep 30 '19 15:09 santiagorr

it would be cool if there were some elementary statistics for each of the datasets

I think there are at least 2 components that this issue could be split up into

  • Convert the SOURCES.md file into something machine readable, like a JSON file, or a folder of YAML files. We could adopt a process similar to what "awesome public datasets" ( https://github.com/awesomedata/awesome-public-datasets ) or "campusdata" did in the past: https://github.com/CampusData/campusdata.github.io/blob/master/_data/rankings.yml .
  • Add metadata about each sample file in the repo. Perhaps we might keep a script around that programmatically generates this info, and stores it. This way you can do things like query for a dataset with at least 1 datetime column, or a dataset with at least 3 quantitative columns and over 3000 rows.

In the meantime, there are at least 2 peer projects that can fulfill some of the data exploration usecases for the single file data requests

  • https://github.com/pandas-profiling/pandas-profiling
  • https://github.com/githubocto/flat-viewer (Take any of the file URLs, and add flat in front, like https://flatgithub.com/vega/vega-datasets/blob/next/data/birdstrikes.csv?filename=data%2Fairports.csv&sha=05fcb7c07b1d76206856e75129fc1e79dc61735c )

hydrosquall avatar Sep 27 '21 23:09 hydrosquall

world-110m.json looks like it could be from https://www.jsdelivr.com/package/npm/world-atlas?version=1.1.4&path=world (https://github.com/topojson/world-atlas).

domoritz avatar May 20 '22 20:05 domoritz