List all sources
Seems like this has fallen by the wayside, but, should it ever come back to development: it would be cool if there were some elementary statistics for each of the datasets. Like how many rows of data, the names of the columns, the types of those columns, etc. Basically the same collection of things that kaggle lists for lots of the datasets on there
It should be great to include the license of the source files as well.
it would be cool if there were some elementary statistics for each of the datasets
I think there are at least 2 components that this issue could be split up into
- Convert the
SOURCES.mdfile into something machine readable, like a JSON file, or a folder of YAML files. We could adopt a process similar to what "awesome public datasets" ( https://github.com/awesomedata/awesome-public-datasets ) or "campusdata" did in the past: https://github.com/CampusData/campusdata.github.io/blob/master/_data/rankings.yml . - Add metadata about each sample file in the repo. Perhaps we might keep a script around that programmatically generates this info, and stores it. This way you can do things like query for a dataset with at least 1 datetime column, or a dataset with at least 3 quantitative columns and over 3000 rows.
In the meantime, there are at least 2 peer projects that can fulfill some of the data exploration usecases for the single file data requests
- https://github.com/pandas-profiling/pandas-profiling
- https://github.com/githubocto/flat-viewer (Take any of the file URLs, and add
flatin front, like https://flatgithub.com/vega/vega-datasets/blob/next/data/birdstrikes.csv?filename=data%2Fairports.csv&sha=05fcb7c07b1d76206856e75129fc1e79dc61735c )
world-110m.json looks like it could be from https://www.jsdelivr.com/package/npm/world-atlas?version=1.1.4&path=world (https://github.com/topojson/world-atlas).