trustgraph icon indicating copy to clipboard operation
trustgraph copied to clipboard

Feature/Bug: Add a pdf filetype check on processing

Open toliver38 opened this issue 10 months ago • 0 comments
trafficstars

When uploading files through the new web UI, there is currently no validation to ensure that files labeled as PDF are actually PDF files. During a recent upload of multiple files, one of the uploaded files was mistakenly labeled as a PDF but was not in the correct format. This caused the processing pipeline to encounter an error, and now the system seems stuck, unable to process further files. From what I can tell this same issue will occur via the API as there is no filetype check

Issue Details:

  • The system attempted to process a non-PDF file, leading to an error where the file header did not match the expected PDF format.
  • Logs indicate an unclear error message and suggest the system is stuck as this same series of message on the same file keeps repeating:

2025-01-02 22:32:54.130 INFO  [140499353970496] Client:86 | Subscribing on Topic :persistent://tg/flow/document-load

2025-01-02 22:32:54.131 INFO  [140498914039488] HandlerBase:111 | [persistent://tg/flow/document-load, decoding.pdf, 0] Getting connection from pool

2025-01-02 22:32:54.131 INFO  [140498914039488] BinaryProtoLookupService:85 | Lookup response for persistent://tg/flow/document-load, lookup-broker-url pulsar://localhost:6650, from [192.168.0.2:41392 -> 192.168.0.11:6650] 

2025-01-02 22:32:54.134 INFO  [140498914039488] ConsumerImpl:300 | [persistent://tg/flow/document-load, decoding.pdf, 0] Created consumer on broker [192.168.0.2:41394 -> 192.168.0.11:6650] 

{'pulsar_host': 'pulsar://pulsar:6650', 'log_level': <LogLevel.INFO: 'info'>, 'metrics': True, 'metrics_port': 8000, 'input_queue': 'persistent://tg/flow/document-load', 'subscriber': 'decoding.pdf', 'output_queue': 'persistent://tg/flow/text-document-load'}

PDF inited

PDF message received

Decoding https://trustgraph.ai/doc/0fe61ab7-2bf4-4b3a-8ce6-d8c013b6458e...

invalid pdf header: b'{\n  "'
  • Subsequent files are not being processed, potentially due to the system halting further operations.

Proposed Solution:

  • Implement a file type check before processing begins. This would verify that files labeled as PDFs match the correct header signature for PDF files (e.g., %PDF- at the beginning of the file).
  • Provide a clear error message to the user if a file fails the type check.
  • Add error handling to ensure the system can continue processing valid files even if an invalid file is encountered.

Steps to Reproduce:

  1. Upload a set of files through the TrustGraph web UI, including a non-PDF file with a .pdf extension.
  2. Observe that the processing pipeline encounters an error and halts further processing.

Expected Behavior:

  • The system should validate files during the upload phase or during processing, reject invalid files, and raise an error.
  • Processing should continue for valid files without interruption.

Logs and Screenshots:

  • Logs provided above. Additional logs/screenshots can be shared upon request.

Priority: Medium

Let me know if further information is needed to address this issue.

This could be as straightforward as a try/catch here - https://github.com/trustgraph-ai/trustgraph/blob/ee9837c9ca628105170540f71871c010b167b06e/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py#L55-L69

toliver38 avatar Jan 04 '25 12:01 toliver38