datproject-discussions
datproject-discussions copied to clipboard
data importer tool
consider this: https://github.com/maxogden/dat-oakland-land-use
look at the package.json import
script. it essentially does these commands
wget -N http://data.openoakland.org/sites/default/files/Oakland_Parcels_06-01-13.zip
unzip -o Oakland_Parcels_06-01-13.zip
and then these as a pipe chain
csv-join http://data.openoakland.org/sites/default/files/ParcelUseCodes2013_0.csv 'Use Code' Oakland_Parcels_06-01-13.csv 'Use code'
bcsv
trim-object-stream
dat import --json --primary \"Assessor's Parcel Number (APN) sort format\""
it would be pretty cool if we had something along the lines conceptually of gulp or grunt but way more minimal. basically take the code for the transformations stuff in dat and make it a standalone module for hooking up data flow/pipe chains using modules from npm
we could call it pipechain
or something, and you could make a json file with stuff for it to do, similar to dat transformations but more to cover the use case of getting data into dat in the first place
cc @mafintosh
a few more thoughts:
in the spectrum where grunt is on one end, gulp is in the middle and npm run
on the other end I think we need something with a unified 'marketing' effort along the lines of gulp and grunt but is actually just npm run
. the problem with npm run
is that it's a feature lost in the sea of features in npm, doesn't have it's own readme, doesn't have it's own logo, name, community
Huh. I did a study in 2011 of Kepler and Taverna workflow systems that found that basically 38% of the workflows used in bioinformatics were shims - essentially, data converters. I bring this up because there's already an extensive scientific literature on what ideal streaming data conversion might look like. I can look around for some papers if you want any, although it's not this field and would probably be pretty technical. You might want to ask @bmpvieira, seems up his alley.
Building a gulp-like system for dat would be pretty fantastic, I think. I only bring this up because it might be useful to look at best practices or suggestions before attempting a minimal system. Might not be, also. But I think a shimming tool like this would be awesome - and it would be very useful outside of the dat framework, as a whole. I know I would like to use it for linguistics data work.
I just created vinyl-dat which provides a src
and dest
method for dat databases in gulp workflows. I know it's not as minimal as just using npm run
, but it might be useful for people already using gulp and just want add dat
.