stream_out for use in creating bulk load formats to databases
@jeroenooms I want to make creation of files for bulk load to Elasticsearch much faster. right now, I simply cat json lines to a file, which doesn't really scale to large files. The format looks like this (for first two rows of mtcars):
{"index":{"_index":"mtcars","_type":"mtcars","_id":0}}
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"}
{"index":{"_index":"mtcars","_type":"mtcars","_id":1}}
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"}
So it has a pair of lines for each document to create. The first line specifies what action to take (create, delete, etc), then where to put it. Indices are like databases, and types are like tables, wrt SQL DB's. Id's are individual documents.
steam_out() can create the rows super fast for the actual data from a data.frame, but i don't see way to create the other lines. Can you pass a function to stream_out() perhaps, that can create additinonal output lines based on the data.frame input and any parameters passed in?
This sort of bulk format I think is used in other databases too, at least similar
Is this some standard format? It's a bit confusing to have the records with "real" data interwoven with the lines of metadata.
That's the format required by Elasticsearch for the bulk load API. It's hard to get into that format, but is much much faster to load into ES than proper JSON through the standard, but slower, non-bulk API.
CouchDB has a sort of similar format
{
"docs": [
{ "_id": "awsdflasdfsadf", "foo": "bar" },
{ "_id": "cczsasdfwuhfas", "bwah": "there" },
...
]
}
But that's proper JSON, so easier to get in that format.
@jeroenooms feel free to close this, i landed on a good enough solution