sipi Support for non-image files

I'm trying to implement support for uploading PDF and CSV documents for our friends in Lausanne. I'd like to upload a PDF file to Sipi and store it in a temporary directory, then move it to a permanent directory. I can't use the directory tmp under imgroot, because then when I try to load the file from tmp (using just a normal URL, not a IIIF URL), Sipi tries to redirect to info.json. I guess this is because everything under imgroot is assumed to be an image.

So I guess I need another tmp directory under server. But then I have another problem: the filename hashing only works under imgroot. Could it be made to work under server as well?

Also, I'm wondering whether we couldn't just use the operating system's /tmp directory, which benefits from some optimisations (e.g. on Linux I think it can be cached in memory). We could make /tmp/sipi/images, /tmp/sipi/server, etc. Would this be possible?

Feb 06 '19 17:02 benjamingeer

We run Sipi inside a container. There is no system tmp folder per se that we can use. We would need to mount an external folder into tmp, but then it is the same thing as the others.

I would prefere to have a complete directory structure under a single folder as the default setting, e.g.,

assets
|- cache
|- images
|- server
|- tmp
|- whateverelse

so that only one folder needs to be mounted.

Also, for long-term preservation, we would need technical metadata. But this is a separate issue.

Feb 06 '19 19:02 subotic

@subotic OK, I guess I misunderstood. I thought you said there was already a Docker mount point for /tmp, and you were surprised when I said that Knora’s Sipi scripts don’t use it.

Feb 06 '19 20:02 benjamingeer

On Travis we mount tmp, because it was needed for some tests. In production I forgot. I can always add it if necessary. Not hard to do. I would simply prefer if we could simplify it. Less stuff that can go wrong.

Feb 06 '19 21:02 subotic

/tmp used to be needed for what was called the non-gui upload case, isn't it the case anymore?

Feb 07 '19 09:02 loicjaouen

After discussion with @lrosenth and @subotic:

All content stored by Sipi will be under one directory, which could be called assets and will have project-specific directories under it, like this:

assets
   |- 0801
       |- A
       |- B
       |- C
          |- 1W6YRMj8VAT-GSQtJWgILX5.jp2
          |- 2pEDmjZo6X2-G8UovBGLixa.pdf
          |- 7phnClRcYeX-DxPQ7qgKZfA.csv
       |- D
       |- E
   |- 0803
       |- A
       |- B
       |- C
       |- D
       |- E
   |- tmp
       |- A
       |- B
       |- C
       |- D
       |- E
   |- cache

This makes it easier to move or back up a project's files, because there's just one directory of files per project. Sipi will determine the file type when the file is requested, and respond appropriately: if the base file URL is requested and the file isn't an image, Sipi will return the file instead of info.json.

@lrosenth expects to be able to do this next week.

Feb 07 '19 13:02 benjamingeer

There is also the case of storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc. Could this case also be covered?

Feb 08 '19 07:02 subotic

storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc.

Couldn't these be served over IIIF URLs, too?

Feb 08 '19 08:02 benjamingeer

For icons and such, I guess so. @kilchenmann will now for sure. We only need to make sure, that all assets that are served through sipi are referenced somewhere in webapi (e.g., in the project), and only accessed by URLs provided by webapi. A client shouldn't be allowed to upload things to sipi without webapi knowing about it (eventually).

The only special case is the watermark image. I think it needs to be an absolute path to the image on disk.

Feb 08 '19 11:02 subotic

For icons (for resource classes) we want to use an existing library and we store only the name of it. So, we don‘t need to upload an image there. But for a project logo we should use the iiif url.

Feb 08 '19 13:02 kilchenmann

Re-posting my question here:

As discussed in the last developer meeting, we will need Knora and Sipi to handle PDF really soon now. From what I understood from @lrosenth, it doesn't involve a lot of work on Sipi...

If we validate/convert the PDF/A ourselves, when could we expect this support for non-image to be ready?

Jul 17 '19 13:07 mrivoal

End of July at latest!

Sent from my iPad

On 17 Jul 2019, at 15:58, Marion Rivoal <[email protected]mailto:[email protected]> wrote:

Re-posting my question here:

As discussed in the last developer meeting, we will need Knora and Sipi to handle PDF really soon now. From what I understood from @lrosenthhttps://github.com/lrosenth, it doesn't involve a lot of work on Sipi...

If we validate/convert the PDF/A ourselves, when could we expect this support for non-image to be ready?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dhlab-basel/Sipi/issues/283?email_source=notifications&email_token=ABJX3TA5Q63ZSZ67NJXZGIDP74QRPA5CNFSM4GUXEHTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EJCJQ#issuecomment-512266534, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJX3TGXZTXFYKIXBEGIEPTP74QRPANCNFSM4GUXEHTA.

Jul 17 '19 14:07 lrosenth

I’ll be on holiday during the first two weeks of August, and will be able to work on the Knora side of this when I get back.

Jul 17 '19 14:07 benjamingeer

Don’t forget we still need support for text files.

Sep 03 '19 08:09 benjamingeer

With the current Sipi, the /knora.json route doesn't return originalFilename or originalMimeType for a PDF file. This means that we have to make these properties optional for file values in knora-base. Is that what we want to do?

Oct 21 '19 13:10 benjamingeer

There is no way to store the original filename and mime type within a PDF header since PDF's are treated as "blobs" when uploading. However, knora.json now returns the internal name and internal mime type (which is the same as the original since a PDF is not modified by SIPI) as originalFilenameand originalMimetype. This could be a problem if a upload script changes the original name – but it does not break knora-base. The only workaround would be to use sidecar files but I consider this problematic...

Jul 02 '20 21:07 lrosenth

In the future, we will need to find a way to store these kinds of information. Just based on my gut feeling, I think that this would be the job of dsp-api. For me, sipi is a media server and shouldn't be responsible for storing preservation metadata. This should be the job of dsp-api. So, everything that would go into a sidecar file, would go into dsp-api.

Jul 03 '20 05:07 subotic

Knora already stores originalFilename and originalMimetype if Sipi provides that information.

Jul 03 '20 05:07 benjamingeer

So for PDFs sipi doesn't/cannot record the original filename. What if someone want's to upload different PDFs under the same name? Shouldn't any files uploaded get a unique name?

Jul 03 '20 11:07 subotic

Knora's upload script always makes a random internalFilename for the uploaded file. The originalFilename is only remembered as part of the FileValue object in the triplestore.

Jul 03 '20 11:07 benjamingeer

sipi sipi copied to clipboard

Support for non-image files

sipi
sipi copied to clipboard