sipi
                                
                                 sipi copied to clipboard
                                
                                    sipi copied to clipboard
                            
                            
                            
                        Support for non-image files
I'm trying to implement support for uploading PDF and CSV documents for our friends in Lausanne. I'd like to upload a PDF file to Sipi and store it in a temporary directory, then move it to a permanent directory. I can't use the directory tmp under imgroot, because then when I try to load the file from tmp (using just a normal URL, not a IIIF URL), Sipi tries to redirect to info.json. I guess this is because everything under imgroot is assumed to be an image.
So I guess I need another tmp directory under server. But then I have another problem: the filename hashing only works under imgroot. Could it be made to work under server as well?
Also, I'm wondering whether we couldn't just use the operating system's /tmp directory, which benefits from some optimisations (e.g. on Linux I think it can be cached in memory). We could make /tmp/sipi/images, /tmp/sipi/server, etc. Would this be possible?
We run Sipi inside a container. There is no system tmp folder per se that we can use. We would need to mount an external folder into tmp, but then it is the same thing as the others.
I would prefere to have a complete directory structure under a single folder as the default setting, e.g.,
assets
|- cache
|- images
|- server
|- tmp
|- whateverelse
so that only one folder needs to be mounted.
Also, for long-term preservation, we would need technical metadata. But this is a separate issue.
@subotic OK, I guess I misunderstood. I thought you said there was already a Docker mount point for /tmp, and you were surprised when I said that Knora’s Sipi scripts don’t use it.
On Travis we mount tmp, because it was needed for some tests. In production I forgot. I can always add it if necessary. Not hard to do. I would simply prefer if we could simplify it. Less stuff that can go wrong.
/tmp used to be needed for what was called the non-gui upload case, isn't it the case anymore?
After discussion with @lrosenth and @subotic:
All content stored by Sipi will be under one directory, which could be called assets and will have project-specific directories under it, like this:
assets
   |- 0801
       |- A
       |- B
       |- C
          |- 1W6YRMj8VAT-GSQtJWgILX5.jp2
          |- 2pEDmjZo6X2-G8UovBGLixa.pdf
          |- 7phnClRcYeX-DxPQ7qgKZfA.csv
       |- D
       |- E
   |- 0803
       |- A
       |- B
       |- C
       |- D
       |- E
   |- tmp
       |- A
       |- B
       |- C
       |- D
       |- E
   |- cache
This makes it easier to move or back up a project's files, because there's just one directory of files per project. Sipi will determine the file type when the file is requested, and respond appropriately: if the base file URL is requested and the file isn't an image, Sipi will return the file instead of info.json.
@lrosenth expects to be able to do this next week.
There is also the case of storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc. Could this case also be covered?
storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc.
Couldn't these be served over IIIF URLs, too?
For icons and such, I guess so. @kilchenmann will now for sure. We only need to make sure, that all assets that are served through sipi are referenced somewhere in webapi (e.g., in the project), and only accessed by URLs provided by webapi. A client shouldn't be allowed to upload things to sipi without webapi knowing about it (eventually).
The only special case is the watermark image. I think it needs to be an absolute path to the image on disk.
For icons (for resource classes) we want to use an existing library and we store only the name of it. So, we don‘t need to upload an image there. But for a project logo we should use the iiif url.
Re-posting my question here:
As discussed in the last developer meeting, we will need Knora and Sipi to handle PDF really soon now. From what I understood from @lrosenth, it doesn't involve a lot of work on Sipi...
If we validate/convert the PDF/A ourselves, when could we expect this support for non-image to be ready?
End of July at latest!
Sent from my iPad
On 17 Jul 2019, at 15:58, Marion Rivoal <[email protected]mailto:[email protected]> wrote:
Re-posting my question here:
As discussed in the last developer meeting, we will need Knora and Sipi to handle PDF really soon now. From what I understood from @lrosenthhttps://github.com/lrosenth, it doesn't involve a lot of work on Sipi...
If we validate/convert the PDF/A ourselves, when could we expect this support for non-image to be ready?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dhlab-basel/Sipi/issues/283?email_source=notifications&email_token=ABJX3TA5Q63ZSZ67NJXZGIDP74QRPA5CNFSM4GUXEHTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EJCJQ#issuecomment-512266534, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJX3TGXZTXFYKIXBEGIEPTP74QRPANCNFSM4GUXEHTA.
I’ll be on holiday during the first two weeks of August, and will be able to work on the Knora side of this when I get back.
Don’t forget we still need support for text files.
With the current Sipi, the /knora.json route doesn't return originalFilename or originalMimeType for a PDF file. This means that we have to make these properties optional for file values in knora-base. Is that what we want to do?
There is no way to store the original filename and mime type within a PDF header since PDF's are treated as "blobs" when uploading. However, knora.json now returns the internal name and internal mime type (which is the same as the original since a PDF is not modified by SIPI) as originalFilenameand originalMimetype.
This could be a problem if a upload script changes the original name – but it does not break knora-base. The only workaround would be to use sidecar files but I consider this problematic...
In the future, we will need to find a way to store these kinds of information. Just based on my gut feeling, I think that this would be the job of dsp-api. For me, sipi is a media server and shouldn't be responsible for storing preservation metadata. This should be the job of dsp-api. So, everything that would go into a sidecar file, would go into dsp-api.
Knora already stores originalFilename and originalMimetype if Sipi provides that information.
So for PDFs sipi doesn't/cannot record the original filename. What if someone want's to upload different PDFs under the same name? Shouldn't any files uploaded get a unique name?
Knora's upload script always makes a random internalFilename for the uploaded file. The originalFilename is only remembered as part of the FileValue object in the triplestore.