kernel-memory
kernel-memory copied to clipboard
[Bug] Endless pipeline loop when trying to import corrupted document
Context / Scenario
In service mode (using a queue), when trying to import a corrupted document (i.e., invalid PDF file), decoder will throw an exception, but then the message will be put again in the queue, generating an endless loop:
warn: Microsoft.KernelMemory.Pipeline.Queue.DevTools.SimpleQueues[0]
Message '20240416.100334.6972903.f0f10e85d780488d927f6c03891d20d5' processing failed with exception,
putting message back in the queue.
Message content: {"index":"default","document_id":"98d5b5a2d38e4cf49dbda5b50c4502c2202404161003346000416","execution_id":"e6f9118e170745168f1d0bc1a9cd39b5","steps":["extract","partition","gen_embeddings","save_records"]}
What happened?
If decoding fails with an exception, I expect that the document will be marked as failed to process
and not process again. I see also that in the DataPipelineStatus class, there is a Failed
property, but it is always false
:
https://github.com/microsoft/kernel-memory/blob/babedc0ef8aeceacc61d607f2fe5fd43d27aa3f8/service/Abstractions/Pipeline/DataPipeline.cs#L432-L448
Importance
I cannot use Kernel Memory
Platform, Language, Versions
Kernel Memory v0.36.240415.2
Relevant log output
dbug: Microsoft.KernelMemory.Handlers.TextExtractionHandler[0]
Extracting text, pipeline 'default/577dbc0439a74a0b8a4996be204b45e3202404161021070293739'
dbug: Microsoft.KernelMemory.Handlers.TextExtractionHandler[0]
Extracting text from file 'Iceberg.pdf' mime type 'application/pdf' using extractor 'Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder'
dbug: Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder[0]
Extracting text from PDF file
warn: Microsoft.KernelMemory.Pipeline.Queue.DevTools.SimpleQueues[0]
Message '20240416.102107.1238584.0a0332828e5f4538ba32bc09b4998acc' processing failed with exception, putting message back in the queue. Message content: {"index":"default","document_id":"577dbc0439a74a0b8a4996be204b45e3202404161021070293739","execution_id":"86d031771dc445d1bd085f110075266e","steps":["extract","partition","gen_embeddings","save_records"]}
UglyToad.PdfPig.Core.PdfDocumentFormatException: Could not find the version header comment at the start of the document.
at UglyToad.PdfPig.Parser.FileStructure.FileHeaderParser.Parse(ISeekableTokenScanner scanner, IInputBytes inputBytes, Boolean isLenientParsing, ILog log)
at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, InternalParsingOptions parsingOptions)
at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options)
at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(Stream stream, ParsingOptions options)
at UglyToad.PdfPig.PdfDocument.Open(Stream stream, ParsingOptions options)
at Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder.DecodeAsync(Stream data, CancellationToken cancellationToken)
at Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder.DecodeAsync(BinaryData data, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Handlers.TextExtractionHandler.ExtractTextAsync(FileDetails uploadedFile, BinaryData fileContent, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Handlers.TextExtractionHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.RunPipelineStepAsync(DataPipeline pipeline, IPipelineStepHandler handler, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.<>c__DisplayClass5_0.<<AddHandlerAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.KernelMemory.Pipeline.Queue.DevTools.SimpleQueues.<>c__DisplayClass19_0.<<OnDequeue>b__0>d.MoveNext()
info: Microsoft.KernelMemory.Pipeline.Queue.DevTools.SimpleQueues[0]
Message received
dbug: Microsoft.KernelMemory.Handlers.TextExtractionHandler[0]
Extracting text, pipeline 'default/577dbc0439a74a0b8a4996be204b45e3202404161021070293739'
dbug: Microsoft.KernelMemory.Handlers.TextExtractionHandler[0]
Extracting text from file 'Iceberg.pdf' mime type 'application/pdf' using extractor 'Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder'
dbug: Microsoft.KernelMemory.DataFormats.Pdf.PdfDecoder[0]
Extracting text from PDF file
warn: Microsoft.KernelMemory.Pipeline.Queue.DevTools.SimpleQueues[0]
Message '20240416.102107.1238584.0a0332828e5f4538ba32bc09b4998acc' processing failed with exception, putting message back in the queue. Message content: {"index":"default","document_id":"577dbc0439a74a0b8a4996be204b45e3202404161021070293739","execution_id":"86d031771dc445d1bd085f110075266e","steps":["extract","partition","gen_embeddings","save_records"]}