datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

data importer tool

Open max-mapper opened this issue 10 years ago • 3 comments

consider this: https://github.com/maxogden/dat-oakland-land-use

look at the package.json import script. it essentially does these commands

wget -N http://data.openoakland.org/sites/default/files/Oakland_Parcels_06-01-13.zip
unzip -o Oakland_Parcels_06-01-13.zip

and then these as a pipe chain

csv-join http://data.openoakland.org/sites/default/files/ParcelUseCodes2013_0.csv 'Use Code' Oakland_Parcels_06-01-13.csv 'Use code'
bcsv
trim-object-stream
dat import --json --primary \"Assessor's Parcel Number (APN) sort format\""

it would be pretty cool if we had something along the lines conceptually of gulp or grunt but way more minimal. basically take the code for the transformations stuff in dat and make it a standalone module for hooking up data flow/pipe chains using modules from npm

we could call it pipechain or something, and you could make a json file with stuff for it to do, similar to dat transformations but more to cover the use case of getting data into dat in the first place

cc @mafintosh

max-mapper avatar Jun 26 '14 05:06 max-mapper

a few more thoughts:

in the spectrum where grunt is on one end, gulp is in the middle and npm run on the other end I think we need something with a unified 'marketing' effort along the lines of gulp and grunt but is actually just npm run. the problem with npm run is that it's a feature lost in the sea of features in npm, doesn't have it's own readme, doesn't have it's own logo, name, community

max-mapper avatar Jun 26 '14 18:06 max-mapper

Huh. I did a study in 2011 of Kepler and Taverna workflow systems that found that basically 38% of the workflows used in bioinformatics were shims - essentially, data converters. I bring this up because there's already an extensive scientific literature on what ideal streaming data conversion might look like. I can look around for some papers if you want any, although it's not this field and would probably be pretty technical. You might want to ask @bmpvieira, seems up his alley.

Building a gulp-like system for dat would be pretty fantastic, I think. I only bring this up because it might be useful to look at best practices or suggestions before attempting a minimal system. Might not be, also. But I think a shimming tool like this would be awesome - and it would be very useful outside of the dat framework, as a whole. I know I would like to use it for linguistics data work.

RichardLitt avatar Aug 27 '14 03:08 RichardLitt

I just created vinyl-dat which provides a src and dest method for dat databases in gulp workflows. I know it's not as minimal as just using npm run, but it might be useful for people already using gulp and just want add dat.

doowb avatar Aug 27 '14 04:08 doowb