nips icon indicating copy to clipboard operation
nips copied to clipboard

NIP-95 - Storage and Shared File

Open frbitten opened this issue 2 years ago • 143 comments

Initially this NIP was together with the PR of NIP-94 but I thought it better to separate it because it will require a greater discussion and it is not necessary to link the approval of one NIP with the other.

I suggest you read #337 before starting the discussion here.

@Egge7 we can continue the discussion here.

frbitten avatar Mar 10 '23 11:03 frbitten

@Egge7 Is BSON not supported by NOSTR just because no relay or client has implemented support for it? Or is it due to some BSON data parser issue where it is not expecting BSON to exist in the JSON string? I've never worked with BSON and I don't know what implications its use in the event would have.

frbitten avatar Mar 10 '23 11:03 frbitten

@Egge7 Is BSON not supported by NOSTR just because no relay or client has implemented support for it? Or is it due to some BSON data parser issue where it is not expecting BSON to exist in the JSON string? I've never worked with BSON and I don't know what implications its use in the event would have.

BSON can not exists inside a JSON as JSON is text and BSON is binary. BSON is a binary representation of a JSON object, which enables it to hold raw data keys (this makes it great for storage as you do not lose space due to encoding). Websockets can transmit blobs as well as text, but everything on nostr is designed to be text ,which is why I proposed a sub-protocol for binary transmissions.

Egge21M avatar Mar 10 '23 12:03 Egge21M

Is there a need for the hash tag now that the contents are in the json itself?

vitorpamplona avatar Mar 12 '23 16:03 vitorpamplona

I don't think so, since the id should contain the data and the signature will verify nothing changed.

It would be pretty neat for large files to chunk them out across relays though, and then you could reduce load off one relay and query many for different chunks of the file. In that case you would probably want the hash for verification when you reassemble everything.

nschairer avatar Mar 16 '23 12:03 nschairer

Yep, breaking the file down into multiple events could be interesting to avoid connection hiccups requiring apps to restart the download from scratch.

It could be very simple with a new "File Header" kind

{
  "id": <32-bytes lowercase hex-encoded sha256 of the the serialized event data>,
  "pubkey": <32-bytes lowercase hex-encoded public key of the event creator>,
  "created_at": <unix timestamp in seconds>,
  "kind": 30065,
  "tags": [
    ["d", <string with name of file>],
    ["decrypt",<algorithm>,<Decryption Params>],
    ["p", <32-bytes hex of a pubkey>, <recommended relay URL>],
    ["hash",< SHA256 hexencoded string of the complete raw data>],
    ["format",<"Base64" or "BSON">], 
    ["e", <part1>, <relay>],
    ["e", <part2>, <relay>],
    ["e", <part...>, <relay>],
    ["e", <partn>, <relay>],
  ],
  "content": "",
  "sig": <64-bytes hex of the signature of the sha256 hash of the serialized event data, which is the same as the "id" field>
}

vitorpamplona avatar Mar 16 '23 12:03 vitorpamplona

Here is a link to the second layer protocol I proposed a couple of weeks back: https://github.com/nostr-ing/nostr-ing-protocol

It is pretty much the same approach, but moved to a second layer that operates on raw data, so its much more efficient than doing all of this directly on nostr. It uses chunks by default in order to facilitate streams of live data too if applicable.

Now I have to say, that I have not spent much time on this since then and never implemented it fully, but still believe that this is a good approach to handling data transmission. on top of nostr.

Egge21M avatar Mar 16 '23 13:03 Egge21M

If you use gzip to compress the entire JSON automatically when connecting with relays, Base64 is not that bad: https://lemire.me/blog/2019/01/30/what-is-the-space-overhead-of-base64-encoding/

vitorpamplona avatar Mar 16 '23 13:03 vitorpamplona

Would be interesting to see performance comparison between @Egge7 's protocol and just keeping it in nostr with base64 chunks when performing encoding/decoding + reassembling + hash validation client / relay side

nschairer avatar Mar 17 '23 13:03 nschairer

Would be interesting to see performance comparison between @Egge7 's protocol and just keeping it in nostr with base64 chunks when performing encoding/decoding + reassembling + hash validation client / relay side

As I said I only started spec'ing and never actually implemented anything. You can definitely do so and then do some benchmarking. However even without benchmarking we know that base64 will take up about 33% more space vs. raw binary data because it uses 6bits per byte.

Egge21M avatar Mar 17 '23 13:03 Egge21M

About the "hash" tag really makes no sense. It ended up coming in copy/paste on NIP94. I will remove it.

I also changed it to be a regular event as being replaceable could cause more trouble than ease. With that the kind changed to 1064

frbitten avatar Mar 17 '23 13:03 frbitten

This suggestion of dividing the data into several parts is very good. But perhaps the use of the e tag is not the ideal way.

Because I need to know the event id of each part and what is its position in the order of the data.

For example I have a file divided into 3 parts. In event 0 you would have tags for events 1 and 2. In event 1 you would have a tag for events 0 and 2 In event 2 you would have a tag for events 0 and 1

I can first receive any event from the list and from there I have to be able to reconstruct all of them. And know in which position I include the first one I received.

I think the tag idea is good but they could be put on the NIP-94 event. Because that way there would only be 1 event that represents the whole file and in it would have the list of parts to be downloaded

frbitten avatar Mar 17 '23 14:03 frbitten

The issue with that idea is that every part cites every other part. They become heavier objects to parse. Instead, my proposal was to do just one header kind that cites all parts, has the hash, decrypting info, etc. Each part only has its contents and cites the header.

vitorpamplona avatar Mar 17 '23 14:03 vitorpamplona

The issue with that idea is that every part cites every other part. They become heavier objects to parse. Instead, my proposal was to do just one header kind that cites all parts, has the hash, decrypting info, etc. Each part only has its contents and cites the header.

That's why I suggested the NIP-94 (#337 ) which is a file sharing header.

frbitten avatar Mar 17 '23 14:03 frbitten

That's different. NIP-94 is just hashed URL. There are no file "parts" baked into it. This one is for stored-in-relay files.

vitorpamplona avatar Mar 17 '23 14:03 vitorpamplona

It can be used in the same way.

As I put in the description, the NIP-94 must be used to disclose the NIP-95 file referencing the event. Because the NIP-95 is not returned in broad searches and only if the event is effectively requested.

I can reference the event by the e tag. And use multiple tags and for multiple parts. The URL tag can also be used in this case if the relay provides the option to download NIP-95 event data via HTTP.

You don't need to create another kind for this. It is perfectly possible to use the NIP-94

frbitten avatar Mar 17 '23 16:03 frbitten

I would separate them. It's too confusing to make them the same event and there might be reasons to separate a search for files in url from a search for files inside relays. Clients might want to support one but not the other, etc.

There is no shortage of integers for each kind.

vitorpamplona avatar Mar 17 '23 16:03 vitorpamplona

Eu os separaria. É muito confuso torná-los o mesmo evento e pode haver motivos para separar uma pesquisa de arquivos em url de uma pesquisa de arquivos dentro de retransmissões. Os clientes podem querer oferecer suporte a um, mas não ao outro, etc.

Não faltam números inteiros para cada tipo.

I see no problem in creating another specific event. But the operation would be the same as the NIP-94 and would only remove the url tag.

I'll put the definition of this new event here in the description.

frbitten avatar Mar 17 '23 16:03 frbitten

Hummmm NIP 94 shouldnt describe file parts. This new one should describe how to reassemble and test hashes.

vitorpamplona avatar Mar 17 '23 16:03 vitorpamplona

[...] a possible solution is for this NIP not to be recorded in the database, but on disk, the file name being the event id [...]

I think I get @frbitten idea. The relay is expected to save the file data in one complete disk file (or blob/bson at db). Then, it will serve it as an event (also, he says it could optionally serve from an https url, but this wouldn't be nostr).

For this to work, the relay would recreate the event dynamically (with same created_at and all other keys), upon request, before serving it to a client. ~So, maybe the one who uploaded won't be the owner of the NIP-95 event (or else, relay wouldn't be able to recreate the event with same id). The private key would be one the relay owns.~

~Better yet, the private key used to sign the event when recreating it dynamically should be a shared one, so that other relays could use it to sign and recreate the event themselves – I think the NIP-94 event id could be used as a private key.~

People later suggested it would be better to split file in chunks. Now it won't be only one disk file/db record.

NIP-94 could have (when referencing a NIP-95 event instead of a simple url tag) tags pointing to the parts like @vitorpamplona said:

{
  // ...,
  tags: [
    // ...,
    ["e", /* NIP-95 event id */, /* recommended relay url */], // first part
    ["e", /* NIP-95 event id */, /* recommended relay url */] // second part
  ]
}

While a NIP-95 event would be

{
  id: 'an id',
  pubkey: 'a pubkey',
  tags: [
    ["e", /* NIP-94 event id */, /* recommended relay url */] // file header,
  ],
  content: "string with sliced base64 data",
  created_at: 11111111
}

So NIP-95 wouldn't be stored as an event. It will be a file with created_at as file metadata, compressed base64 data and filename taken from the event.id. I think the e tag can't be a filesystem metadata like created_at can, so maybe it should be inside an auxiliary event-id.json. Well so better to put all metadata inside the json file like info about how the relay stored the main file, like the ~file extension (e.g. .gzip)~ used compression, created_at, stringified tags array.

🤔 All this to save like 28% of space in comparison to storing the uncompressed base64 event. Worth it!

arthurfranca avatar Mar 17 '23 19:03 arthurfranca

I am not sure why you would do all that. You can simply receive the event, take the base64 contents, convert it back to binary, save it on disk with the event ID as filename, and store the JSON without the .content field in the database. When a client requests it, pull the event from the database, use the Id to fetch the .content from the disk, recreate the base64 version, and send it signed by the original author as if nothing has happened.

Easy.

vitorpamplona avatar Mar 17 '23 19:03 vitorpamplona

In fact, relays should do that for all events whose .content is bigger than.. say... 1MB... unless their DB solution handles binary content in disk really well.

vitorpamplona avatar Mar 17 '23 19:03 vitorpamplona

They are all valid options. The issue is that the way to store it is up to the relay. Another option is to use a NO-SQL database such as mongoDB which already supports large files (up to 16MBs) and could put everything in the database. And with that, it would gain the facility to replicate the database and files in several clustered instances to meet a large demand from customers.

NOSTR defines the form of communication. How the data is stored is up to the relay to define the best format. I see no problem spending a little more on the transfer for simplicity. As I mentioned above, I can't imagine anyone wanting to store gigabytes that way. The trend will be small files. The Relay can also define maximum limits.

frbitten avatar Mar 17 '23 19:03 frbitten

I see this NIP being used for profile pictures, images, and short videos inside posts. Large files, but not THAT large.

vitorpamplona avatar Mar 17 '23 19:03 vitorpamplona

I am not sure why you would do all that.

I was trying to figure out what @frbitten suggested when he also said "the file name being the event id. So it can be easily found and searched. And because it is not in the database, it does not interfere with the indexing of common events."

If the event isn't stored, relay wouldn't be indexing the event.id, created_at and e tag (although e tag could be ee instead just to avoid indexing if it is expected to request NIP-95 by its event id only). So it would save a bit of space.

I am not sure why you would do all that. You can simply receive the event, take the base64 contents, convert it back to binary, save it on disk with the event ID as filename, and store the JSON without the .content field in the database. When a client requests it, pull the event from the database, use the Id to fetch the .content from the disk, recreate the base64 version, and send it signed by the original author as if nothing has happened.

Easier indeed.

arthurfranca avatar Mar 17 '23 19:03 arthurfranca

That's exactly the idea. This event is outside of all indexing, searching, and nostr relay processing. It will only be returned if requested by its own ID. This avoids a lot of processing and sending unwanted data

frbitten avatar Mar 17 '23 21:03 frbitten

Nostr is great for some things and not for other things. The protocol design is not designed to handle large files. I think we should work on a Nostr-like protocol designed for storing large blobs instead.

Semisol avatar Mar 27 '23 16:03 Semisol

Events should not be used for storing data. There's the issue with base64 encoding increasing storage usage by 33%, there's the issue with being not streamable since a single WS message cannot be streamed, there's the issue of WS message size limits, ...

The increased data size is a price to pay for ease. As described at the beginning, the idea is not to store large data, but small information that is constantly used by applications or that needs to be shared easily. Currently there is nothing simple and practical to share small data. The WS limits are quite large and easily meet this need. Relays can define their maximum limits that are accepted.

Another important detail is that NOSTR is not a storage protocol but a communication protocol. The relay can choose how to store the data in the most efficient way for its case.

frbitten avatar Mar 28 '23 09:03 frbitten

Taking a step back… Having Nostr be the source for files (rather than just having text that may link to files elsewhere) increases the chance of problems with illegal content. Do we really want to be the people providing censorship resistance for CSAM (child sexual abuse materials)? Yes, by potentially linking to problem content, we're already involved. But actually distributing and hosting the files in question is a whole other level of involvement.

NIP-94 seems like a great idea. But this one worries me. I get where there are use cases where this may be really beneficial, but are they sufficient to outweigh the problems created by unintended consequences? I'd rather see the actual hosting of large files be something separate from Nostr and just have Nostr focus on the types of things it's already showing it can do well.

s3x-jay avatar Apr 20 '23 14:04 s3x-jay

Taking a step back… Having Nostr be the source for files (rather than just having text that may link to files elsewhere) increases the chance of problems with illegal content. Do we really want to be the people providing censorship resistance for CSAM (child sexual abuse materials)? Yes, by potentially linking to problem content, we're already involved. But actually distributing and hosting the files in question is a whole other level of involvement.

NIP-94 seems like a great idea. But this one worries me. I get where there are use cases where this may be really beneficial, but are they sufficient to outweigh the problems created by unintended consequences? I'd rather see the actual hosting of large files be something separate from Nostr and just have Nostr focus on the types of things it's already showing it can do well.

It happens with the notes. Nothing prevents taking text from a book and putting it in a note. Or use the markdown PIN to spread illegal content. Or give instructions on how to buy illegal products. Sharing a link to a copyrighted image in a note is the same issue. If you think about it that way, we have to do away with NOSTR and anything decentralized and rely on centralized companies and platforms.

Most only think of the NIP-95 for sharing large files, but its biggest use is for small things for quick and simple access.

frbitten avatar Apr 20 '23 15:04 frbitten

This is a bad idea. WebSockets have a limited frame size, base64 is inefficient, and a lot more with using Nostr (such as having to put this data in the DB).

@Semisol I don't know which frame limit you are referring to. But in the websocket specification the limit is almost infinite (9,223,372,036,854,775,807 bytes ~= 9.22 exabytes), maybe it's some limitation of the language you're used to. The websocket is designed to break the large message into small TCP blocks and send all these blocks and reassemble the message at the destination (https://www.rfc-editor.org/rfc/rfc6455#section-10.4). And even if a language defines a frame limit that limit applies to everything, I manage to hit it with a standard NOSTR note as well.

Yes base64 is inefficient, as transferring data in JSON is too. But NOSTR is not a protocol with an emphasis on efficiency, but on its simplicity and practicality. Therefore, I do not see these as issues that impede the NIP. You may find it a bad idea and it's your right but everyone is free to implement the NIPs they think are most important in their relay. So I don't see a technical reason for a pull request rejection.

Each relay can create their limits that they are willing to work with and implement the NIPs they find useful.

frbitten avatar Apr 24 '23 09:04 frbitten