PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Bug of Multithreading

Open AlexanderLeung9 opened this issue 4 years ago • 23 comments

The PdfDocument can't GetPage() in multithreading environment, no matter use a PdfDocument through all threads or use a PdfDocument per thread.

AlexanderLeung9 avatar Apr 21 '20 06:04 AlexanderLeung9

Hi, sorry to hear you're encountering an issue. As far as I'm aware opening of PdfDocuments across threads should work ok, but access to the contents of a document is not thread safe. So for example the following code is fine:

Parallel.Foreach(filenames, f => {
  using (var document = PdfDocument.Open(f))
  {
      foreach (var page in document.GetPages())
     {
           // do something with the pages.
     }
  }
})

Whereas doing this will not work

using (var document = PdfDocument.Open(f))
{
   Parallel.Foreach(document.GetPages(), p => {
     // do something with the page
  })
}

Since the document uses a single reader for the stream/bytes. If you'd be able to provide an example of the code which is displaying the problem I'll check if it's working/failing as intended.

EliotJones avatar Apr 21 '20 12:04 EliotJones

private static List<UPC.Page> ConvertPdfToPages(string pdfFilePath, Tuple<ushort, ushort> pdfPageRange) { const int cMaxPageCountPerTask = 5;

        int pageCount = pdfPageRange.Item2 - pdfPageRange.Item1 + 1;
        int taskNum = pageCount / cMaxPageCountPerTask;
        if (pageCount % cMaxPageCountPerTask != 0)
            ++taskNum;

        Task<UPC.Page[]>[] tasks = new Task<UPC.Page[]>[taskNum];

        for (int i = 0; i != taskNum; ++i)
        {
            tasks[i] = new Task<UPC.Page[]>((index) =>
            {
                int j = (int)index;

                int fromPage = pdfPageRange.Item1 + cMaxPageCountPerTask * j;
                int toPage = pdfPageRange.Item1 + cMaxPageCountPerTask * (j + 1) - 1;
                if (toPage > pdfPageRange.Item2)
                    toPage = pdfPageRange.Item2;

                UPC.Page[] pages = new UPC.Page[toPage - fromPage + 1];

                using (PdfDocument document = PdfDocument.Open(pdfFilePath))
                {
                    for (int k = 0; k != pages.Length; ++k)
                    {
                        pages[k] = document.GetPage(fromPage + k);
                    }
                }

                return pages;
            }, i);
        }

        Array.ForEach<Task>(tasks, x => x.Start());
        Task.WaitAll(tasks);

        List<UPC.Page> results = new List<UPC.Page>();
        foreach (var item in tasks)
        {
            results.AddRange(item.Result);
        }

        return results;
    }

AlexanderLeung9 avatar Apr 21 '20 12:04 AlexanderLeung9

Besides, the function can work when use Sautinsoft.PdfFocus.

AlexanderLeung9 avatar Apr 21 '20 12:04 AlexanderLeung9

Hi, sorry to hear you're encountering an issue. As far as I'm aware opening of PdfDocuments across threads should work ok, but access to the contents of a document is not thread safe. So for example the following code is fine:

Parallel.Foreach(filenames, f => {
  using (var document = PdfDocument.Open(f))
  {
      foreach (var page in document.GetPages())
     {
           // do something with the pages.
     }
  }
})

Whereas doing this will not work

using (var document = PdfDocument.Open(f))
{
   Parallel.Foreach(document.GetPages(), p => {
     // do something with the page
  })
}

Since the document uses a single reader for the stream/bytes. If you'd be able to provide an example of the code which is displaying the problem I'll check if it's working/failing as intended.

Hi, could you fix the problem since I have posted an example of code?

AlexanderLeung9 avatar Apr 24 '20 06:04 AlexanderLeung9

Hi,

Running your code I didn't seem to get any error. Are you getting an exception and if so would you be able to share the exception message please?

I suspect what might be happening is you're access this UPC.Page methods after the PdfDocument has been disposed. Since pages rely on accessing the document for things like Hyperlinks, images, annotations, etc they need to be evaluated prior to disposing of the document. For your use case this would be something like:

// Store the things you'll need for your code
private class ThingsINeed
{
    public UPC.PageSize Size { get; set; }

    public string Text { get; set; }

    public IReadOnlyList<UPC.IPdfImage> Images { get; set; }
}

// Only call methods on `Page` before the document is disposed and store the results
            Task<ThingsINeed[]>[] tasks = new Task<ThingsINeed[]>[taskNum];

            for (int i = 0; i != taskNum; ++i)
            {
                tasks[i] = new Task<ThingsINeed[]>((index) =>
                {
                    int j = (int)index;

                    int fromPage = pdfPageRange.Item1 + cMaxPageCountPerTask * j;
                    int toPage = pdfPageRange.Item1 + cMaxPageCountPerTask * (j + 1) - 1;
                    if (toPage > pdfPageRange.Item2)
                        toPage = pdfPageRange.Item2;

                    ThingsINeed[] pages = new ThingsINeed[toPage - fromPage + 1];

                    using (PdfDocument document = PdfDocument.Open(pdfFilePath))
                    {
                        for (int k = 0; k != pages.Length; ++k)
                        {
                            var page = document.GetPage(fromPage + k);
                            pages[k] = new ThingsINeed
                            {
                                Size = page.Size,
                                Text = page.Text,
                                Images = page.GetImages().ToList()
                            };
                        }
                    }

                    return pages;
                }, i);
            }

EliotJones avatar Apr 25 '20 08:04 EliotJones

Hi,

Thank you for checking.

However, the code crashes before I call any methods of attributes of UPC.Page. The crashed code line is as follows: pages[k] = document.GetPage(fromPage + k);

($exception).Message: 源数组长度不足。请检查 srcIndex 和长度以及数组的下限。(The length of source array is insufficient, please check srcIndex, length, and the lower bound of array.)

($exception).Data: System.Collections.ListDictionaryInternal

($exception).StackTrace:
   在 System.Array.Copy(Array sourceArray, Int32 sourceIndex, Array destinationArray, Int32 destinationIndex, Int32 length, Boolean reliable)
   在 System.Collections.Generic.List`1.RemoveAt(Int32 index)
   在 UglyToad.PdfPig.Fonts.CompactFontFormat.Dictionaries.CompactFontFormatTopLevelDictionaryReader.ApplyOperation(CompactFontFormatTopLevelDictionary dictionary, List`1 operands, OperandKey key, IReadOnlyList`1 stringIndex)
   在 UglyToad.PdfPig.Fonts.CompactFontFormat.Dictionaries.CompactFontFormatDictionaryReader`2.ReadDictionary(TBuilder builder, CompactFontFormatData data, IReadOnlyList`1 stringIndex)
   在 UglyToad.PdfPig.Fonts.CompactFontFormat.Dictionaries.CompactFontFormatTopLevelDictionaryReader.Read(CompactFontFormatData data, IReadOnlyList`1 stringIndex)
   在 UglyToad.PdfPig.Fonts.CompactFontFormat.CompactFontFormatIndividualFontParser.Parse(CompactFontFormatData data, String name, IReadOnlyList`1 topDictionaryIndex, IReadOnlyList`1 stringIndex, CompactFontFormatIndex globalSubroutineIndex)
   在 UglyToad.PdfPig.Fonts.CompactFontFormat.CompactFontFormatParser.Parse(CompactFontFormatData data)
   在 UglyToad.PdfPig.PdfFonts.Parser.Parts.CidFontFactory.ReadDescriptorFile(FontDescriptor descriptor)
   在 UglyToad.PdfPig.PdfFonts.Parser.Parts.CidFontFactory.Generate(DictionaryToken dictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.ParseDescendant(DictionaryToken dictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.PdfFonts.FontFactory.Get(DictionaryToken dictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(DictionaryToken fontDictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary, Boolean isLenientParsing)
   在 UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
   在 UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
   在 UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
   在 PDFParser.StreamModeParser.Parser.<>c__DisplayClass0_0.<ConvertPdfToPages>b__1(Object index) 位置 F:\PDFParser\StreamModeParser.cs:行号 60
   在 System.Threading.Tasks.Task`1.InnerInvoke()
   在 System.Threading.Tasks.Task.Execute()

FYI Thanks and Regards.

AlexanderLeung9 avatar Apr 25 '20 09:04 AlexanderLeung9

By the way, the version of PdfPig I am referencing is 0.1.0. @EliotJones

AlexanderLeung9 avatar Apr 25 '20 09:04 AlexanderLeung9

Ah, ok, based on the error message this looks like a bug in how PdfPig handles Compact Font Format (CFF) fonts. This will be specific to this file.

Are you able to share the file at all so I can try and work out where it's going wrong please?

Would you also be able to check against version 0.1.1 of PdfPig please, there are some bug fixes between the versions that might fix this.

EliotJones avatar Apr 25 '20 09:04 EliotJones

OK, it was fine after I update the PdfPig to version 0.1.1. Thanks a lot! If you want the sample PDF file, you can send me your email address. I'm sorry it is inconvenient to publish the sample file.

AlexanderLeung9 avatar Apr 25 '20 11:04 AlexanderLeung9

Thanks, glad to hear it worked ok, don't worry about the sample file since it works now.

EliotJones avatar Apr 25 '20 14:04 EliotJones

Hi, the code went wrong again after I upgraded.

    public static List<Page> ConvertPdfToPages(string pdfFile, Tuple<ushort, ushort> pdfPageRange)
    {
        const int cMaxPageCountPerTask = 5;

        int pageCount = pdfPageRange.Item2 - pdfPageRange.Item1 + 1;
        int taskNum = pageCount / cMaxPageCountPerTask;
        if (pageCount % cMaxPageCountPerTask != 0)
            ++taskNum;

        Task<Page[]>[] tasks = new Task<Page[]>[taskNum];

        for (int i = 0; i != taskNum; ++i)
        {
            tasks[i] = new Task<Page[]>((index) =>
            {
                int j = (int)index;

                int fromPage = pdfPageRange.Item1 + cMaxPageCountPerTask * j;
                int toPage = pdfPageRange.Item1 + cMaxPageCountPerTask * (j + 1) - 1;
                if (toPage > pdfPageRange.Item2)
                    toPage = pdfPageRange.Item2;

                Page[] pages = new Page[toPage - fromPage + 1];

                using (PdfDocument document = PdfDocument.Open(pdfFile))
                {
                    for (int k = 0; k != pages.Length; ++k)
                    {
                        pages[k] = document.GetPage(fromPage + k);
                    }
                }

                return pages;
            }, i);
        }

        Array.ForEach<Task>(tasks, x => x.Start());
        Task.WaitAll(tasks);

        List<Page> results = new List<Page>();
        foreach (var item in tasks)
        {
            results.AddRange(item.Result);
        }

        return results;
    }

My colleague reported that code crashes by accident on Task.WaitAll(tasks), and after I changed the code into single thread version, the problem disappeared. So I think the problem should raise in below:

                using (PdfDocument document = PdfDocument.Open(pdfFile))
                {
                    for (int k = 0; k != pages.Length; ++k)
                    {
                        pages[k] = document.GetPage(fromPage + k);
                    }
                }

Is there any static or global thing that document.GetPage() will access but not thread-safe? Could you have a inspection on it? Thank you!

AlexanderLeung9 avatar May 19 '20 01:05 AlexanderLeung9

Hi, sorry to hear this has reoccurred, when you say it started happening again after you upgraded version, which version are you now using? I think there might have been some new thread-unsafe changes to do with performance but I don't recall which version they were new in.

EliotJones avatar May 23 '20 15:05 EliotJones

It's the newest version, 0.1.1.0.

AlexanderLeung9 avatar May 23 '20 22:05 AlexanderLeung9

Sorry I haven't been able to work on this much due to life. Are you able to share the full exception details you or your colleague is getting please? I'm assuming it's due to some edge case in the documents you're working with hitting a rare code path that isn't thread safe but it's easier to start from the place the error is occurring and work backwards.

EliotJones avatar Jun 01 '20 08:06 EliotJones

I'm sorry I can't raise the exception again, because the code has already been in production environment with a single thread version. They won't easily tough the code, so I have no chance to test it. And the problem only rises in their testing environment, but is safe in my developing environment.

AlexanderLeung9 avatar Jun 01 '20 11:06 AlexanderLeung9

I'm keeping this open in case anyone else has the same issue because it's not yet been fixed. But given the nature of threading bugs it's not going to be possible to fix without an example to reproduce it. Sorry I can't help further, at the moment.

EliotJones avatar Jul 26 '20 13:07 EliotJones

FYI am experiencing some issues as well, process works 100% if I put a lock around opening the documents (shown below). If I don't have the lock and have a high level on concurrency (say 16 threads) it will crash essentially every time with the shown stacktrace (or similar). Haven't had a chance to dig in yet.

lock (locker)
{
    document = PdfDocument.Open(contents);
}
UglyToad.PdfPig.Core.PdfDocumentFormatException: Could not find dictionary associated with reference in pages kids array: 662 0.
   at UglyToad.PdfPig.Parser.CatalogFactory.ProcessPagesNode(IndirectReference reference, DictionaryToken nodeDictionary, IndirectReference parentReference, Boolean isRoot, IPdfTokenScanner pdfTokenScanner, Boolean isLenientParsing, Int32& pageNumber) in C:\source\Github\PdfPig\src\UglyToad.PdfPig\Parser\CatalogFactory.cs:line 133
   at UglyToad.PdfPig.Parser.CatalogFactory.Create(IndirectReference rootReference, DictionaryToken dictionary, IPdfTokenScanner scanner, Boolean isLenientParsing) in C:\source\Github\PdfPig\src\UglyToad.PdfPig\Parser\CatalogFactory.cs:line 51
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, ILog log, Boolean isLenientParsing, IReadOnlyList`1 passwords, Boolean clipPaths) in C:\source\Github\PdfPig\src\UglyToad.PdfPig\Parser\PdfDocumentFactory.cs:line 142
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options) in C:\source\Github\PdfPig\src\UglyToad.PdfPig\Parser\PdfDocumentFactory.cs:line 78

plaisted avatar Jan 29 '21 23:01 plaisted

@plaisted Fine, got it! Thanks a lot!

AlexanderLeung9 avatar Feb 19 '21 03:02 AlexanderLeung9

I'd have a very easy repro @EliotJones if you're still interested and using v0.1.4:

using (var document = PdfDocument.Open(@"529-Article Text-2712-1-10-20130506.pdf"))
{
    Parallel.For(0, document.NumberOfPages, (pageIndex, ls) =>
    {
        var pageNumber = pageIndex + 1;
        var page = document.GetPage(pageNumber);
    });
}

That file came up as a sample .pdf googling around real quick for one that's large enough / has enough pages and can be download at this URL.

and the exception is:

UglyToad.PdfPig.Core.PdfDocumentFormatException
  HResult=0x80131500
  Message=Expected name as dictionary key, instead got: 1
  Source=UglyToad.PdfPig.Tokenization
  StackTrace:
   at UglyToad.PdfPig.Tokenization.DictionaryTokenizer.ConvertToDictionary(List`1 tokens)
   at UglyToad.PdfPig.Tokenization.DictionaryTokenizer.TryTokenize(Byte currentByte, IInputBytes inputBytes, IToken& token)
   at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext()
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.MoveNext()
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.Get(IndirectReference reference)
   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner)
   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IToken token, IPdfTokenScanner scanner)
   at UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(DictionaryToken fontDictionary)
   at UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary)
   at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths)
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths)
   at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
   at PdfNetCoreTextParse.Program.<>c__DisplayClass4_0.<Main>b__0(Int32 pageIndex, ParallelLoopState ls) in D:\git\Playground\PdfParser\PdfParser\Program.cs:line 47
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)

  This exception was originally thrown at this call stack:
    UglyToad.PdfPig.Tokenization.DictionaryTokenizer.ConvertToDictionary(System.Collections.Generic.List<UglyToad.PdfPig.Tokens.IToken>)
    UglyToad.PdfPig.Tokenization.DictionaryTokenizer.TryTokenize(byte, UglyToad.PdfPig.Core.IInputBytes, out UglyToad.PdfPig.Tokens.IToken)
    UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext()
    UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.MoveNext()
    UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.Get(UglyToad.PdfPig.Core.IndirectReference)
    UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get<T>(UglyToad.PdfPig.Core.IndirectReference, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner)
    UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get<T>(UglyToad.PdfPig.Tokens.IToken, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner)
    UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(UglyToad.PdfPig.Tokens.DictionaryToken)
    UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(UglyToad.PdfPig.Tokens.DictionaryToken)
    UglyToad.PdfPig.Parser.PageFactory.Create(int, UglyToad.PdfPig.Tokens.DictionaryToken, UglyToad.PdfPig.Content.PageTreeMembers, bool)
    ...
    [Call Stack Truncated]

One can work around by using a locker object around the document.GetPage(...) call, i.e. like this:

private static readonly object _locker = new object();

// ...

Page page = null;
lock (_locker)
{
    page = document.GetPage(pageNumber);
}

or something like a Semaphore and the usual concurrency controlling mechanisms, but that's a bit non-obvious given the exception. Ideally it would be handled inside the .GetPage() method / the underlying infrastructure and allow real concurrent reading OR throw a somewhat more self-explanatory exception.

joergbattermann avatar Aug 13 '21 20:08 joergbattermann

Hi @joergbattermann thanks for the additional information. This crash is actually expected in this case since access inside a single document is not thread-safe, since we're always working with a single access to the underlying document's content stream. Access to the data in a single document should only occur on a single thread but it should be possible to open and work with separate documents on separate threads. So for example:

var files = Directory.GetFiles('*.pdf');
Parallel.ForEach(files, file => {
  // Work with the file on this thread.
});

Should be supported, whereas parallelising access to the pages inside a single document is not expected to be supported.

EliotJones avatar Aug 14 '21 17:08 EliotJones

One additional Q here: once the pages are retrieved in an interlocked way, so always exactly one thread accessing the document to call .GetPage(number) of it, is it safe to work on / with those pages concurrently.. more precisely, to retrieve the texts on them using the various word extractors and page segmentors?

joergbattermann avatar Aug 14 '21 17:08 joergbattermann

@joergbattermann currently for letters, GraphicsStateOperations and Paths and methods GetText and GetWords yes, however for page.GetImages() and page.GetHyperlinks() it is not since that evaluates the images from the content stream on invocation. In order to improve text performance in future I may move to a lazily evaluated model for everything on Page but for now those 3 properties and 2 methods and all properties of Letter itself should be safe.

Edit: In general I'd encourage a copy-what-you-use model for thread safe access, e.g. doing var mySafeLetters = page.Letters.ToList() or similar just so future internal changes to thread safety don't break your code in mysterious and difficult to diagnose ways.

EliotJones avatar Aug 14 '21 18:08 EliotJones

Thanks for the clarification Eliot, much appreciated!

joergbattermann avatar Aug 14 '21 18:08 joergbattermann