grist-core icon indicating copy to clipboard operation
grist-core copied to clipboard

Externalized attachments

Open fflorent opened this issue 1 year ago • 11 comments

The idea has been proposed by @paulfitz https://github.com/gristlabs/grist-core/issues/886#issuecomment-2125243918

If I understand correctly, and I tend to think that's also something we would like on the ANCT's side: the idea would be to allow creating an attachment column where the file would not be injected in the document but rather in an external storage like S3.

There are several advantages, especially reducing the grist document size and thus allowing increasing the attachment size limit.

fflorent avatar Jun 06 '24 14:06 fflorent

Thanks for opening this @fflorent ! Going to dump in some thoughts a developer prepared about this in 2022. Grist has changed somewhat since, and also this was not a plan just one person's thoughts (I personally disagree with some of it), but the set of concerns raised could be helpful for things about this project.

Externalizing attachments

External store

We need a generic interface for storing and retrieving file data that can be implemented in different ways. One obvious data store is S3-compatible stores. Theoretically, the local filesystem might also work.

Migration

Once at least one store is implemented, _gristsys_Files could be deprecated, and a special migration could move the data from there to the external store. The usual Python migration system won’t be enough on its own because the data engine doesn’t see _gristsys_Files, but maybe it could make an external call to node to deal with that.

Downloading documents

We still want to be able to download a single self-contained .grist sqlite file containing all the attachments. When this happens, we’d need to:

  1. Make a copy of the database file
  2. Download all the externalized attachments and put them in the copy, perhaps back in _gristsys_Files
  3. Give that to the user to download
  4. When the document is uploaded again, perform the same process as the migration to move data to the external store.

This would also allow using downloaded documents in older versions of Grist.

Serving attachments without the DocWorker

Currently the client uses a special DocWorker API to view and download attachments. To serve the files, the DocWorker retrieves them from _gristsys_Files. In the first iteration of work, this would be changed to retrieving them from the external store instead. But in the long term, it would be nice if the client could bypass the DocWorker and retrieve the files directly from the store. S3 would work well for this, but other types of store may not allow this.

Deleting externalized attachments

Attachments are likely to contain sensitive data, and storing them longer than necessary is a security risk. When a user deletes an attachment, it’s reasonable for them to expect it to actually be deleted eventually, just like any other data, so that it can’t be leaked. This applies whether they deleted a row, a document, or an entire organisation. We can’t actually fully delete the data immediately in the first case because deleted rows still live in the snapshot history, but we should delete them eventually.

This is like the problem of tracking attachments referenced within a document, on a much larger scale. In this case actually tracking the references (or maybe just their counts) from documents to the external store seems essential. These would need to be updated whenever a document is copied or deleted within a Grist installation. We’d need to consider:

  • “Duplicate Document”
  • “Work on a copy”
  • Other ways of ‘forking’ such as from fiddle mode or templates
  • Creation and pruning of snapshots.
  • Deleting a document permanently.

Downloading a document ‘disconnects’ it from the Grist installation so it doesn’t need to be counted. It has its own copy of the attachments so it should either delete or ignore the metadata about the externalized data.

Encryption

An alternative to tracking attachment references to allow deleting them is to encrypt the attachment data to avoid the need to delete it. Each attachment file would have a unique encryption key stored only in the corresponding row of _grist_Attachments . Once all copies of that row are fully deleted, the encryption key should be lost, and decrypting the data in the external store should become impossible. That means we don’t ever have to delete the actual data, so we don’t need to keep track of references to it.

Another security benefit of encryption is that if someone gains access to the data in the external attachments store, they can’t actually read it unless they also have the referencing documents.

One downside is that serving attachments directly from S3 instead of the DocWorker becomes more tricky. Decrypting and displaying a single encrypted file in the browser using SubtleCrypto and createObjectURL seems straightforward. But it’s a lot more delicate to handle a user scrolling through a grid filled with thumbnails, displaying them all efficiently and then reclaiming memory after they disappear from view.

Access Control

Would need thinking about. Important to preserve the property that the existing metadata (particularly fileIdent) is not enough to download the file, so that access is properly revoked even if someone has a past copy of the metadata. It might also be nice if the download URL couldn’t be computed purely from the file content, so that someone with a local copy of a file can’t test whether it exists in the document.

paulfitz avatar Jun 06 '24 14:06 paulfitz

I added this to the other issue already. If you are looking for inspiration, HackMD has a pretty nice integration of Drag&Drop Image upload, which just puts files in a folder on the server using a random hash. UX-wise it is a pretty nice seamless integration and works flawlessly.

Sieboldianus avatar Sep 10 '24 04:09 Sieboldianus

Just to note that architecture work for this feature has started, with some initial thinking from @Spoffy here: https://docs.google.com/document/d/1ST_DuR22llDyx4PVAMdBlHDz22L5QIJ4qIyLMOUOtZ8/edit#heading=h.guzwiuwiigkd If anyone would like edit rights, let me know.

There are trade-offs to be made since Grist will no longer be a standalone single-file format. We're trying to come up with a design that still makes moving Grist docs between installations practical.

paulfitz avatar Oct 09 '24 16:10 paulfitz

@Spoffy I updated the proposed change to _grist_Files as follows:

  • Removed a separate new identifier column. How about we work with the existing ident column, improving it as needed. It is a little funky but that's ok, and better than having two very similar columns.
  • I converted the boolean column to an enum with just a single non-null value. It is equivalent and every boolean I've let into a schema I've eventually regretted.

If you combine these two changes, you end up with a very harmless low-commitment schema change that seems safe to do even without all details worked out?

paulfitz avatar Oct 10 '24 22:10 paulfitz

Paul and I have an initial design for this now, found here.

The latest design / implementation is in the doc, so I won't copy and paste it here.

Next steps are prototyping this approach and run some basic tests to ensure we haven't missed anything that would cause a big problem.

Spoffy avatar Oct 28 '24 17:10 Spoffy

Quick update on this. The prototype is underway now, and we've got basic attachment functionality working with MinIO and Filesystem storage.

Spoffy avatar Nov 13 '24 00:11 Spoffy

The first PR for this is open, albeit not yet ready for full review and merging: https://github.com/gristlabs/grist-core/pull/1320

It covers the necessary core document changes to make external attachments function, with just about enough configuration to let it be testable.

The full scope of that PR is in the PR description :slightly_smiling_face:

@fflorent @vviers @hexaltation - if any of you want to have a first look, now is a good time :slightly_smiling_face:

Spoffy avatar Nov 25 '24 22:11 Spoffy

@fflorent @hexaltation @vviers

#1358 just got merged into main, so external attachments should now be available using the GRIST_EXTERNAL_ATTACHMENTS_MODE=snapshots environment variable. This will make Grist store attachments in the same external storage as your snapshots. Only MinIO (or S3 via the MinIO client) is supported right now.

It also adds a few API endpoints that should help while we add the UI, which is in a separate PR:

  • POST /api/docs/:docId/attachments/transferAll - Starts all files transferring from their current storage to the configured one for a specific doc.
  • GET /api/docs/:docId/attachments/transferStatus - Gets the status of the active transfer.
  • GET /api/docs/:docId/attachments/store - Gets the configured attachment store for a doc.
  • POST /api/docs/:docId/attachments/store - Sets the configured attachment store for a doc (requires a body in this shape: { type: 'internal' } or { type: 'external' }.

@fflorent - If you could test this out before we run the release, that'd be great, just to make sure it's working at your end.

Spoffy avatar Feb 04 '25 19:02 Spoffy

Hello @Spoffy

thanks a lot for the job :)

I wrote a script to test and it works for small documents. Not yet tested on our 1GB files.

GET old store
{"type":"external"}

POST store type
POST BODY : {"type":"external"}
{"store":"hL5WbM2TsZjDYT3ghqCHeU-snapshots"}

GET new store
{"type":"external"}

Start transfer
https://preprod.of.grist.gouv.fr:443/api/docs/xxxxxxxxxxxxxxxxxx/attachments
{"status":{"pendingTransferCount":3,"isRunning":true},"locationSummary":"internal"}

get transfert status step 1
{"status":{"pendingTransferCount":0,"isRunning":false},"locationSummary":"external"}
get transfert status step 2
{"status":{"pendingTransferCount":0,"isRunning":false},"locationSummary":"external"}
get transfert status step 3
{"status":{"pendingTransferCount":0,"isRunning":false},"locationSummary":"external"}
get transfert status step 4
{"status":{"pendingTransferCount":0,"isRunning":false},"locationSummary":"external"}
get transfert status step 5
{"status":{"pendingTransferCount":0,"isRunning":false},"locationSummary":"external"}

hexaltation avatar Feb 12 '25 14:02 hexaltation

I understand that this would not be within the scope of the prototype, but perhaps OpenDAL could be used for the final implementation. It's a library that acts as an interface for various storage solutions, ranging from the local file system and S3 to Dropbox and MongoDB. This may help to simplify the implementation and makes it significantly easier to support other storage backends.

Image

It supports the following storage backends.

Type Services
Standard Storage Protocols ftp http sftp webdav
Object Storage Services azblob cos gcs obs oss s3 b2 openstack_swift upyun vercel_blob
File Storage Services fs alluxio azdls azfile compfs dbfs gridfs hdfs hdfs_native ipfs webhdfs
Consumer Cloud Storage Service aliyun_drive gdrive onedrive dropbox koofr pcloud seafile yandex_disk
Key-Value Storage Services cacache cloudflare_kv dashmap memory etcd foundationdb persy redis rocksdb sled redb tikv
Database Storage Services d1 mongodb mysql postgresql sqlite surrealdb
Cache Storage Services ghac memcached mini_moka moka vercel_artifacts
Git Based Storage Services huggingface

Note that it supports SQLite, so in document storage of attachments could simply be another backing implementation.

QazCetelic avatar Jun 13 '25 06:06 QazCetelic

@QazCetelic anything that can implement this interface https://github.com/gristlabs/grist-core/blob/f1415b813473ffd3758dafa76f5a4824fd63906c/app/server/lib/ExternalStorage.ts#L24-L78 could be used as an external store. For attachments specifically, versioning isn't necessary, so it could be be an even simpler subset of the functionality. Grist supports having multiple interfaces to external storage https://github.com/gristlabs/grist-core/blob/f1415b813473ffd3758dafa76f5a4824fd63906c/app/server/lib/ICreate.ts#L141-L147 so adding something for OpenDAL could be quite practical.

paulfitz avatar Jun 13 '25 15:06 paulfitz

FYI, this issue still shows as Open.

Enro avatar Oct 09 '25 15:10 Enro

@paulfitz I think I miss permissions for that, could you unpin this issue?

We can still see it here:

Image

fflorent avatar Nov 19 '25 13:11 fflorent

Done, thanks for reminder @fflorent.

paulfitz avatar Nov 19 '25 14:11 paulfitz