arangodb icon indicating copy to clipboard operation
arangodb copied to clipboard

support for blob or binary attachments

Open kaerus opened this issue 12 years ago • 53 comments

I think it is essential to be able to store raw binary data either as a blob or attachments (as couchdb does). The current workaround is to Base64 encode which actually bloats the size with aprox. 130%. You end up having to shuffle more bytes and perform additional decoding which wastes cpu cycles (and energy). So please add support for attachments or other means to store binary data in the database.

kaerus avatar Jun 18 '12 08:06 kaerus

Hi kaerus, I am helping the team with mostly non-technical tasks, e.g. I am collecting use cases for ArangoDB and extract requirements for the roadmap. Can you tell me a bit more what you do with ArangoDB so that we can find a solution for you? You can reach me here or via [email protected]. Thanx.

luebbert42 avatar Jun 18 '12 09:06 luebbert42

What kind of binary data do you have in mind regarding the size?

Do you have any opinion/ideas on how the Server / Client protocol should look like? JSON has no support for binary data as far as I can see. If using attachments, one could forgo any encoding and send the binary data in the HTTP body. If using blobs within a document, I have no idea how to encode this.

fceller avatar Jun 18 '12 09:06 fceller

Hi, thanks for the quick reply. My concerns are mainly with compressed binary data such as images but also other file formats such as PDF etc. However I would like to have the option to compress arbitrary content (in a CMS system I'm currently developing) since many web clients today support compressed content (accept-encoding: gzip, deflate). Compression, in my case, increases the overall performance, especially when accessed over slow or long latency networks. In short, being able to store and retrieve data in a raw binary format would make the user experience better in my case.

kaerus avatar Jun 18 '12 10:06 kaerus

fceller: Regarding the protocol, I was referring to couchdb attachments for that specific reason. I have not put any more thought if that is the optimal way to do it, but it is atleast simple to interface with. But perhaps BSON could be a better alternative http://bsonspec.org/

Size of data varies depending on usage, but anything from small jpegs (4KB) to streamable movies (1GB) I guess. :)

kaerus avatar Jun 18 '12 10:06 kaerus

Thinking about my use case a little more, smaller blob's (< 16KB) could be embedded as base64 encoded data. Larger files could be wired as multipart binary content and stored as document attachments. Another option is to store attachments directly on disk and only maintain references in the database. However I prefer to have everything in one place for simplicity.

kaerus avatar Jun 18 '12 11:06 kaerus

fceller: What is your take on this. Is it better to keep things as they are and workaround binary attachments with alternate storage i.e on local disk or NAS or is it feasible/practical to store binaries directly in arangodb? I don't know how this could affect memory utilization e.t.c in the db, perhaps a simple pro vs con list from your side would clear things up regarding this matter.

kaerus avatar Jun 18 '12 11:06 kaerus

I've to think about this: Storing very large files in the database seems will in principle work, but might not be as performant as storing such files on a local disk and only store the file-name.

On the other hand, I might be nice to just have one server to which a client can talk to.

I need to check how couchdb attachments work.

fceller avatar Jun 18 '12 11:06 fceller

The more I think about this the more I lean towards just keeping references to files in arangodb. Why? Since my users could just as well store their attachments in other remote servers such as amazon cloud servers etc. I will prototype a solution based on this approach meanwhile you guys come up with a solution. :) I still think the possibility to store binaries is essential and I like the idea to be able to implement quota restrictions using the journalSize feature.

kaerus avatar Jun 18 '12 12:06 kaerus

This is indeed a good argument. There are specialized distributed file-server out there, which will do a better job when dealing with files.

For smaller files: yes, it should be possible. Question still remains how to implement this.

fceller avatar Jun 18 '12 16:06 fceller

I've discussed the issue with Martin:

Da wir insgesamt versuchen Sachen effizient zu machen, sollten solche unstrukturierten grossen Dinge binär übertragen und dann auch binär gespeichert werden.

Damit man mit solchen Blobs effizient umgehen kann muss man damit aber anders umgehen als mit anderen Dokumenten. So will man die Semantik anders haben (sprich: man will für solche grossen Dinge keine synchrone Replikation haben). Man will sie anders speichern (weil unsere normale Copying Garbage Collection für solche grossen Dinge eher ineffizient ist).

Und ich wollte nicht versuchen diese unterschiedliche Behandlung vor dem Entwickler zu verbergen (was bei anderer Replikationssemantik auch gar nicht ginge). Daher sollte es also eigene Funktionen zum Speichern, Abfragen und Löschen von solchen Blobs geben.

Translated:

As we try to do things efficiently in total, should Such unstructured binary large objects and then transferred are also stored in binary form.

In order to deal efficiently with such blobs have to handle it differently but as with other documents. so want to have different semantics (ie: you want for such great things have no synchronous replication). They want to they store different (because our normal copying garbage Collection of such great things is more efficient).

And I did not want to try this different To conceal the treatment from the developer (which is other replication semantics would not). hence So it should own functions for storing, querying Delete and type of such blobs.

fceller avatar Jun 19 '12 07:06 fceller

From my point of view the best approach would be to create a separate document for each file referencing the 'parent document'. That way you avoid having to do a merge if you get revision collisions when a user is updating document and at the same time uploading attachments. A graph relation could be used I guess, will attempt that.

kaerus avatar Jun 19 '12 20:06 kaerus

I'm adding this quote from another thread. "I think that arangodb should be data agnostic. Perhaps you need to create a binaries api and have a separate process that handles binary file attachments."

Although that is not really agnostic (blame json), you can at least store data in any format. Do you know if there is any way making json work with multipart http content ? I'm wondering if it's possible to send both {json stuff} and ----- files ------ bundled in the same request.

kaerus avatar Jun 27 '12 12:06 kaerus

Adding these for reference. Multipart/form-data: http://www.ietf.org/rfc/rfc2388.txt
FormData: https://developer.mozilla.org/en/XMLHttpRequest/FormData
https://developer.mozilla.org/en/DOM/XMLHttpRequest/FormData/Using_FormData_Objects

kaerus avatar Jun 27 '12 13:06 kaerus

The possibility to store binaries in the key store would also be nice.

kaerus avatar Aug 14 '12 11:08 kaerus

To summarize: the implementation will:

  • add a new marker of for blobs. Blobs are like documents plus a binary attachment. Maybe someone finds a more suitable name than "blob". Attachment? File?

  • add a type to the collection: a collection can either be a "normal" one, where all documents are just documents. An "edge" collection, where all documents are edges. A "file" collection where all documents are blobs/file/attachments.

  • there is a special request to get/set the attachment, something like

    GET /_api/attachment/ PUT /_api/attachment/ POST /_api/attachment/

    It must be possible to define a "content-type" using "PUT" or "POST". That type is then return by "GET".

  • It should be possible (or maybe even the default?) to use a file collection with the key/value api

fceller avatar Aug 17 '12 11:08 fceller

Great news, although I'm little unsure of the implication having a special collection for binaries. But perhaps that is a good feature, anyhow I'm looking forward to try this out. :)

kaerus avatar Aug 21 '12 13:08 kaerus

Yes, we decided to use a deticated collection because replication will be different for large blobs. 

Am 21.08.2012 um 15:43 schrieb "Anders Elo" <[email protected] mailto:[email protected] >:

Great news, although I'm little unsure of the implication having a special collection for binaries. But perhaps that is a good feature, anyhow I'm looking forward to try this out. :)

— Reply to this email directly or view it on GitHub.

fceller avatar Aug 21 '12 13:08 fceller

It's a while ago since this issue has been updated, so I don't know where it's at. However I'd like to add something. :D

Coming from MongoDB I really liked their implementation of GridFS ->http://www.mongodb.org/display/DOCS/GridFS & http://www.mongodb.org/display/DOCS/GridFS+Specification

A nice use-case was when using sharding with keys that define geographical areas, one could serve user-specific files from the areas closest to the user.

It would be nice if you could keep that use-case in mind for ArangoDB. It would be a top feature.

frankmayer avatar Oct 02 '12 19:10 frankmayer

Jan has started with the blob-collections. I think he will still need a few more days.

fceller avatar Oct 02 '12 20:10 fceller

Ah, great. Will the aforementioned use case (file location based on geo-sharding) be possible with this implementation?

It would really be a great feature.

frankmayer avatar Oct 02 '12 20:10 frankmayer

I assume that this is a feature of sharding? We should keep that use case in mind, when implementing sharding.

fceller avatar Oct 02 '12 20:10 fceller

Yes, actually it's a feature of sharding. If you implement a special collection for files like I read above, then this collection could also be shardable by a specific shard key, which could be a geo key.

Also what's interesting in MongoDB's GridFS is that files have their metadata which can be modified (I am thinking also about permissions here (see record level permissions discussion=>https://groups.google.com/forum/?fromgroups=#!topic/arangodb/OxvfE-H_ug0 )) and also file paths, which also give some flexibility.

frankmayer avatar Oct 02 '12 20:10 frankmayer

#61

weinberger avatar Feb 12 '13 16:02 weinberger

The https://www.arangodb.com/roadmap/ mentions binary data, Support for big data blobs.

What's the plan to implement this?

frankgerhardt avatar Oct 18 '15 21:10 frankgerhardt

The use cases with binary data we have seen so far did the following: They kept the actual binary data in some other place like S3 or a distributed filesystem and kept the metadata (pathname etc.) in ArangoDB. This approach works very well and even has some advantages over keeping everything in the database. Nevertheless, I would be interested in your particular use case for binary blobs in ArangoDB. This would allow us to judge the actual need for binary blobs in ArangoDB better. If you do not want to post this here publicly, you can reach me at [email protected] privately.

neunhoef avatar Oct 19 '15 10:10 neunhoef

+1 for implementing blob storage. We use MongoDB's GridFS, and it's usage pattern is very good.

thearchitect avatar Nov 10 '15 13:11 thearchitect

https://github.com/Kronos-Integration/archive-arangodb/blob/master/doc/index.adoc <- seems to be a .js implementation, referenced here since it may be usefull for others that may want to do similar things with foxx

dothebart avatar Dec 17 '15 16:12 dothebart

Jan has created an example foxx service handling binary data: http://jsteemann.github.io/blog/2014/10/15/handling-binary-data-in-foxx/ However, you need to store the payload itself to a raw file.

dothebart avatar Jan 28 '16 14:01 dothebart

For internally supported data types (since 3.0), see the VelocyPack specs.

What's problematic is the data transport in and out of the system via JSON over HTTP. You can't have binary data in JSON for instance (unless you base64-code it). The upcoming binary communication protocol VelocyStream will open up more possibilities.

Simran-B avatar Oct 18 '16 23:10 Simran-B

This is all very encouraging. My understanding is that in 3.0 VelocyPack is implemented for the storage layer, and in 3.1 VelocyStream will bring it to the Java driver at least.

My question is, what is the story with respect to Foxx? My understanding is that Foxx scripts are dealing with documents that are backed directly by the internal representation. So if I for example write this:

var res = db.example.insert({ blob: Buffer("AEFCY2RFRg==", 'base64') });
var doc = db.example.document(res._id);
assert(doc.blob instanceof Buffer);
assert(doc.blob.length === 7);

Will the asserts pass in Foxx in 3.1 as I would hope / expect? How about 3.0?

Edit: fixed typo in code...

vsivsi avatar Nov 04 '16 19:11 vsivsi