pydot icon indicating copy to clipboard operation
pydot copied to clipboard

Data Race in parse_dot_data.

Open wildoranges opened this issue 1 year ago • 12 comments

in function parse_dot_data, a global variable top_graphs is used. This global variable will be overwritten repeatedly in multithreading situations. It is a data race and may produce unexpected results.

https://github.com/pydot/pydot/blob/1270c6d8e7e77e8dbdad460d22d6448658893717/src/pydot/dot_parser.py#L499-L516

The following is an example used with ThreadPoolExecutor that may cause data race if the user does not know the implementation of pydot(the call chain is graph_from_dot_file -> graph_from_dot_data -> parse_dot_data):

import pydot
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(pydot.graph_from_dot_file,  dot_file) for dot_file in file_list]
    for future in tqdm(as_completed(futures), total=len(futures))
        future.result()

wildoranges avatar Aug 05 '24 08:08 wildoranges

Thanks for reporting this.

This is a legacy thing, left over from the way pydot was written a long time ago.

I think the easiest way out is to make top_graphs a thread-local variable.

lkk7 avatar Aug 05 '24 20:08 lkk7

Yeah, the entire parser is implemented in the fashion of graphviz's own, which is notoriously not thread-safe and practically made of global variables.

Personally, I'd prefer to see the parser in dot_parser.py turned into a class, with each parsing run performed by a new instance of that class, all of the functions turned into class methods, and all of the globals migrated to instance variables. (graphparser itself, the actual pyparsing parser definition, is also a global, and since it maintains state I worry about its reentrancy as well.)

ferdnyc avatar Aug 05 '24 22:08 ferdnyc

(It'd also make debugging f---loads easier if we could actually access all of the components of the parser, from outside the module. Right now, everything except the global graphparser is a local variable inside the graph_definition() function, making it hard to get at, say, the node_stmt or attr_list definition in isolation. The first thing I usually do when working in the parser code is rip out the entire graph_parser function definition and pull its contents up to the module level where I can see it all.)

ferdnyc avatar Aug 05 '24 22:08 ferdnyc

dot_parser.py turned into a class

That was my first thought. But I didn't want to break APIs again.

But now I realize that dot_parser.py is more internal than I remembered. I'm sure some unlucky legacy user will have their code broken because they use an obscure function from dot_parser directly, but that's something we can accept.

lkk7 avatar Aug 06 '24 07:08 lkk7

Personally, I'd prefer to see the parser in dot_parser.py turned into a class, with each parsing run performed by a new instance of that class, all of the functions turned into class methods, and all of the globals migrated to instance variables. (graphparser itself, the actual pyparsing parser definition, is also a global, and since it maintains state I worry about its reentrancy as well.)

So, my first attempt at doing that not only broke, but it broke the OLD parser as well. (I copied dot_parser.py to a new file, class-ified the parser, and then made it conditional in pydot/__init__.py which parser would be used. I was very surprised that that alone caused it to break.)

That's OK, tho, turns out that breakage was just the code to test the logging, interacting badly with unittest-parallel because it reloads all of the modules. If I either ripped out the logging test, OR switched from unittest-parallel to regular unittest, I was back to things working OK with the OLD dot_parser.py code.

However, my new code still breaks with random ERROR results from various tests.

Turns out pyparsing is simply not thread-safe, itself (pyparsing/pyparsing#89), and if we want to make it possible to call the parser from multiple threads we'll have to put a lock on it so that only one thread at a time can be in the parser. There's no such lock in pyparsing because, as pyparsing's lead dev @ptmcg says in that issue,

if we protect the whole of pyparsing with a lock

This is something that is more easily/cleanly done in the calling code than in pyparsing itself, especially since there are so few parse-time entry points.

The same may be true here, as well. We can possibly make dot_parser safe for multiple threads, but we'd be doing it by locking the code so that only one thread at a time can be in it. It may be easier to just not try calling it from multiple threads at the same time, because that's not going to work, and it seems there's nothing pydot can do to make that work regardless.

ferdnyc avatar Aug 07 '24 01:08 ferdnyc

(The other possibility is, we could simply copy.deepcopy and reset the global top_graphs before returning it from each parse_dot_data() call, so that the results of previous parsing runs won't be impacted by any future runs. It'll slow things down slightly, but may be enough thread-safety for most users. Which is good because it's also apparently as much thread-safety as we can provide.)

ferdnyc avatar Aug 07 '24 01:08 ferdnyc

We might not even need a deepcopy, a shallow copy might be sufficient come to think of it. Never mind, we're already effectively doing a shallow copy by returning list(tokens), but clearly that's not sufficient because those tokens are still references into top_graphs. (In fact, we already do reset top_graphs on each call, so the thread-safety issue must really be in the graphparser.)

ferdnyc avatar Aug 07 '24 01:08 ferdnyc

Aaand, #TIL that if you copy.deepcopy a pydot.Dot object, it loses all of its instance variables!

Not sure why, yet, but clearly a bug. Possibly in our pickling code.

But it definitely breaks things, when pydot.Dot.create() can't find self.prog because it DOESN'T EXIST.

Edit: Yeah, changing __setstate__ to:

def __setstate__(self, state):
    self.__init__(obj_dict=state)

fixes it. Turns out, the objects created when using __getstate__ and __setstate__ don't normally get __init__'ed.

...While that fixes it, it isn't the correct fix, because it means that a copied Dot object's instance variables (prog, shape_files, and formats) will get reset to default when copied/pickled.

ferdnyc avatar Aug 07 '24 02:08 ferdnyc

Turns out top_graphs doesn't need to be global at all. #404 eliminates it, with no ill effects. (However, this WON'T overcome pyparsing's lack of thread safety and magically make dot_parser thread safe.)

ferdnyc avatar Aug 07 '24 03:08 ferdnyc

(The deepcopy digression was a dead end — you can't deepcopy frozendicts, by design. So that went nowhere. But it wasn't necessary anyway, and I'm glad I tried it because it exposed the broken copying/pickling of pydot.Dot objects.)

ferdnyc avatar Aug 07 '24 03:08 ferdnyc

Well, if pyparsing did not set a lock beacuse

This is something that is more easily/cleanly done in the calling code than in pyparsing itself, especially since there are so few parse-time entry points.

Then we can take the same position. parse_dot_data is the only dangerous place and we could just strongly annotate the fact that the function is not thread-safe.

lkk7 avatar Aug 07 '24 21:08 lkk7

Reopening, as I'm hopeful that #497 can result in a thread-safe Pydot after all.

ferdnyc avatar Sep 03 '25 20:09 ferdnyc