dataverse
dataverse copied to clipboard
StorageIO.isBelowIngestSizeLimit() ignores instance-wide defaults
I.e., if the ingest size limit property is not defined for the specific store driver, no limit is assumed, even if there is an instance-wide limit configured (via a setting). This store-specific ingest limit mechanism does not appear to support the concept of defining it for specific formats, as can be done with the instance-wide limits.
What happens then, IngestService marks the file as "scheduled for ingest" (ingestStatus=66
). This generally does not result in anything fatal. I.e., no tab files over the instance-wide limits end up getting ingested, because the size limit is enforced later, before the file is put on the JMS ingest queue. But, based on some reports, it is possible for the files to get stuck in that state (I'm assuming, when something else goes wrong before/during the final version save, so it never gets to the "startIngestJobs" point where the JMS queue is initiated, as described just above). The end result is that the file are shown with the "Ingest in Progress" labels on the dataset page. (but, note that these will be reset next time any files are added to the version - because the same JMS queue initialization size check will be performed again on all "scheduled for ingest" files and the ones above the limit will be knocked down to ingestStatus=none
).
Somewhat more seriously, inside IngestService bean this .isBelowIngestSizeLimit()
is now used to decide whether to attempt to extract other types of metadata from files - specifically, netCDF/HDF files. In this case the side effect is that IngestService will attempt to read a terabyte-size file, unless a lower limit exists on the store (important to keep in mind that, unlike tab. data ingest, this metadata extraction happens in real time, synchronously).