fscrawler
fscrawler copied to clipboard
Raw metadata not being populated for PDF documents
Describe the bug
I'm running a test on 3 PDF documents. The properties under "file", "path", "attributes" appear to be extracted and stored in Elasticsearch correctly. However raw metadata is not. The metadata property is an empty object {}.
Job Settings
---
name: "test_job"
fs:
url: "/Users/blevine/work/es/fulfillment/test_docs"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: true
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: false
attributes_support: true
raw_metadata: true
xml_support: false
index_folders: false
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
Expected behavior
Raw metadata should be populated.
Note: exiftool returns the following
[{
"SourceFile": "BOOK_COS_Release_Notes.pdf",
"ExifToolVersion": 12.08,
"FileName": "BOOK_COS_Release_Notes.pdf",
"Directory": ".",
"FileSize": "5.6 MB",
"FileModifyDate": "2020:10:27 12:18:40-04:00",
"FileAccessDate": "2020:10:27 12:19:04-04:00",
"FileInodeChangeDate": "2020:10:27 12:18:40-04:00",
"FilePermissions": "rw-r--r--",
"FileType": "PDF",
"FileTypeExtension": "pdf",
"MIMEType": "application/pdf",
"PDFVersion": 1.5,
"Linearized": "No",
"Encryption": "Standard V1.3 (40-bit)",
"UserAccess": "Extract",
"Title": "Release Notes",
"Creator": "DocBook XSL Stylesheets V1.79.1",
"Author": "Ab Initio",
"Producer": "XEP 4.30.961",
"Product": "Co>Operating System",
"Lang": "en-US",
"Internal": false,
"ProdVer": "4.0.2",
"Trapped": false,
"CreateDate": "2020:10:14 19:26:12Z",
"ModifyDate": "2020:10:14 19:26:12Z",
"PageCount": 286,
"PageMode": "UseOutlines",
"Warning": "XMP format error (no closing tag for ?)",
"TaggedPDF": "Yes",
"Language": "en"
}]
Versions:
- OS: MacOS
- Version 2.7-SNAPSHOT (fscrawler-es7-2.7-20201027.085254-130)
Attaching metadata output from Apache Tika metadata.txt
Additional information. It seems that if I set index_content: false
, metadata is skipped. When set to true
all metadata including the raw metadata is output as expected. I can tell that the metadata is being skipped because when I set index_content: true
, I get a few warning messages that say: "Metadata is not encrypted, but was expected to be". I don't see these warnings when I set index-content: false
. Problem is, I really don't want to include the content in my index.
Ok. I took a look at the source and saw that metadata is processed only when index_content is true. I'm not really sure why this is and the documentation isn't really explicit about this behavior. I worked around it by setting indexed_chars to 0. So I guess you can close this issue if this is really the expected behavior or consider it a documentation bug.
I think that the workaround you found is very good. Would you like to add a note in the documentation?
I'd probably add the note here: https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#ignore-content