fscrawler icon indicating copy to clipboard operation
fscrawler copied to clipboard

Raw metadata not being populated for PDF documents

Open blevine opened this issue 4 years ago • 4 comments

Describe the bug

I'm running a test on 3 PDF documents. The properties under "file", "path", "attributes" appear to be extracted and stored in Elasticsearch correctly. However raw metadata is not. The metadata property is an empty object {}.

Job Settings

---
name: "test_job"
fs:
  url: "/Users/blevine/work/es/fulfillment/test_docs"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: false
  attributes_support: true
  raw_metadata: true
  xml_support: false
  index_folders: false
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Expected behavior

Raw metadata should be populated.

Note: exiftool returns the following

[{
  "SourceFile": "BOOK_COS_Release_Notes.pdf",
  "ExifToolVersion": 12.08,
  "FileName": "BOOK_COS_Release_Notes.pdf",
  "Directory": ".",
  "FileSize": "5.6 MB",
  "FileModifyDate": "2020:10:27 12:18:40-04:00",
  "FileAccessDate": "2020:10:27 12:19:04-04:00",
  "FileInodeChangeDate": "2020:10:27 12:18:40-04:00",
  "FilePermissions": "rw-r--r--",
  "FileType": "PDF",
  "FileTypeExtension": "pdf",
  "MIMEType": "application/pdf",
  "PDFVersion": 1.5,
  "Linearized": "No",
  "Encryption": "Standard V1.3 (40-bit)",
  "UserAccess": "Extract",
  "Title": "Release Notes",
  "Creator": "DocBook XSL Stylesheets V1.79.1",
  "Author": "Ab Initio",
  "Producer": "XEP 4.30.961",
  "Product": "Co>Operating System",
  "Lang": "en-US",
  "Internal": false,
  "ProdVer": "4.0.2",
  "Trapped": false,
  "CreateDate": "2020:10:14 19:26:12Z",
  "ModifyDate": "2020:10:14 19:26:12Z",
  "PageCount": 286,
  "PageMode": "UseOutlines",
  "Warning": "XMP format error (no closing tag for ?)",
  "TaggedPDF": "Yes",
  "Language": "en"
}]

Versions:

  • OS: MacOS
  • Version 2.7-SNAPSHOT (fscrawler-es7-2.7-20201027.085254-130)

blevine avatar Oct 27 '20 21:10 blevine

Attaching metadata output from Apache Tika metadata.txt

blevine avatar Oct 27 '20 22:10 blevine

Additional information. It seems that if I set index_content: false, metadata is skipped. When set to true all metadata including the raw metadata is output as expected. I can tell that the metadata is being skipped because when I set index_content: true, I get a few warning messages that say: "Metadata is not encrypted, but was expected to be". I don't see these warnings when I set index-content: false. Problem is, I really don't want to include the content in my index.

blevine avatar Oct 28 '20 01:10 blevine

Ok. I took a look at the source and saw that metadata is processed only when index_content is true. I'm not really sure why this is and the documentation isn't really explicit about this behavior. I worked around it by setting indexed_chars to 0. So I guess you can close this issue if this is really the expected behavior or consider it a documentation bug.

blevine avatar Oct 28 '20 02:10 blevine

I think that the workaround you found is very good. Would you like to add a note in the documentation?

I'd probably add the note here: https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#ignore-content

dadoonet avatar Jan 20 '21 14:01 dadoonet