jesterj icon indicating copy to clipboard operation
jesterj copied to clipboard

Reconcile handling of the "file size"

Open dgoldenberg1234 opened this issue 8 years ago • 1 comments

We've agreed that we'll need two distinct "system level" fields to maintain the content size:

  • the "original_content_size" - the size of an input file as it got pushed into the framework, e.g. by a scanner
  • the "final_content_size" - the size of the processed content. E.g. if Tika has run, this would be the size of the extracted text.

Right now, we have a FIELD_FILE_SIZE on the Document interface. That'll need to be refactored accordingly.

Any references to the "file_size" literal e.g. in SimpleFileWatchScanner will need to be refactored accordinly.

dgoldenberg1234 avatar Apr 01 '16 19:04 dgoldenberg1234

These are conventions for field names but not integral to the running of the system. I think we don't want to make anything in this ticket specific to Tika, but if the Tika processor is to be enhanced it can define it's own default field. Probably we should allow these names to be tweaked by config. "original_content_size" can be the size of the byte[] created at the scanner/listener entry point, and "content_size" can be the 'normal' field that a processor sets if it changes the byte[]. Processors are free to additionally set their own record of the size of the source data such as tika_content_size or stax_spliter_content_size.

nsoft avatar Apr 19 '16 23:04 nsoft

I think I'm going to wait until someone needs this to implement, and at that time address the specific need.

nsoft avatar Feb 21 '23 18:02 nsoft