Data Race in parse_dot_data.
in function parse_dot_data, a global variable top_graphs is used. This global variable will be overwritten repeatedly in multithreading situations. It is a data race and may produce unexpected results.
https://github.com/pydot/pydot/blob/1270c6d8e7e77e8dbdad460d22d6448658893717/src/pydot/dot_parser.py#L499-L516
The following is an example used with ThreadPoolExecutor that may cause data race if the user does not know the implementation of pydot(the call chain is graph_from_dot_file -> graph_from_dot_data -> parse_dot_data):
import pydot
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(pydot.graph_from_dot_file, dot_file) for dot_file in file_list]
for future in tqdm(as_completed(futures), total=len(futures))
future.result()
Thanks for reporting this.
This is a legacy thing, left over from the way pydot was written a long time ago.
I think the easiest way out is to make top_graphs a thread-local variable.
Yeah, the entire parser is implemented in the fashion of graphviz's own, which is notoriously not thread-safe and practically made of global variables.
Personally, I'd prefer to see the parser in dot_parser.py turned into a class, with each parsing run performed by a new instance of that class, all of the functions turned into class methods, and all of the globals migrated to instance variables. (graphparser itself, the actual pyparsing parser definition, is also a global, and since it maintains state I worry about its reentrancy as well.)
(It'd also make debugging f---loads easier if we could actually access all of the components of the parser, from outside the module. Right now, everything except the global graphparser is a local variable inside the graph_definition() function, making it hard to get at, say, the node_stmt or attr_list definition in isolation. The first thing I usually do when working in the parser code is rip out the entire graph_parser function definition and pull its contents up to the module level where I can see it all.)
dot_parser.pyturned into a class
That was my first thought. But I didn't want to break APIs again.
But now I realize that dot_parser.py is more internal than I remembered.
I'm sure some unlucky legacy user will have their code broken because they use an obscure function from dot_parser directly, but that's something we can accept.
Personally, I'd prefer to see the parser in
dot_parser.pyturned into a class, with each parsing run performed by a new instance of that class, all of the functions turned into class methods, and all of the globals migrated to instance variables. (graphparseritself, the actual pyparsing parser definition, is also a global, and since it maintains state I worry about its reentrancy as well.)
So, my first attempt at doing that not only broke, but it broke the OLD parser as well. (I copied dot_parser.py to a new file, class-ified the parser, and then made it conditional in pydot/__init__.py which parser would be used. I was very surprised that that alone caused it to break.)
That's OK, tho, turns out that breakage was just the code to test the logging, interacting badly with unittest-parallel because it reloads all of the modules. If I either ripped out the logging test, OR switched from unittest-parallel to regular unittest, I was back to things working OK with the OLD dot_parser.py code.
However, my new code still breaks with random ERROR results from various tests.
Turns out pyparsing is simply not thread-safe, itself (pyparsing/pyparsing#89), and if we want to make it possible to call the parser from multiple threads we'll have to put a lock on it so that only one thread at a time can be in the parser. There's no such lock in pyparsing because, as pyparsing's lead dev @ptmcg says in that issue,
if we protect the whole of pyparsing with a lock
This is something that is more easily/cleanly done in the calling code than in pyparsing itself, especially since there are so few parse-time entry points.
The same may be true here, as well. We can possibly make dot_parser safe for multiple threads, but we'd be doing it by locking the code so that only one thread at a time can be in it. It may be easier to just not try calling it from multiple threads at the same time, because that's not going to work, and it seems there's nothing pydot can do to make that work regardless.
(The other possibility is, we could simply copy.deepcopy and reset the global top_graphs before returning it from each parse_dot_data() call, so that the results of previous parsing runs won't be impacted by any future runs. It'll slow things down slightly, but may be enough thread-safety for most users. Which is good because it's also apparently as much thread-safety as we can provide.)
We might not even need a deepcopy, a shallow copy might be sufficient come to think of it. Never mind, we're already effectively doing a shallow copy by returning list(tokens), but clearly that's not sufficient because those tokens are still references into top_graphs. (In fact, we already do reset top_graphs on each call, so the thread-safety issue must really be in the graphparser.)
Aaand, #TIL that if you copy.deepcopy a pydot.Dot object, it loses all of its instance variables!
Not sure why, yet, but clearly a bug. Possibly in our pickling code.
But it definitely breaks things, when pydot.Dot.create() can't find self.prog because it DOESN'T EXIST.
Edit: Yeah, changing __setstate__ to:
def __setstate__(self, state):
self.__init__(obj_dict=state)
fixes it. Turns out, the objects created when using __getstate__ and __setstate__ don't normally get __init__'ed.
...While that fixes it, it isn't the correct fix, because it means that a copied Dot object's instance variables (prog, shape_files, and formats) will get reset to default when copied/pickled.
Turns out top_graphs doesn't need to be global at all. #404 eliminates it, with no ill effects. (However, this WON'T overcome pyparsing's lack of thread safety and magically make dot_parser thread safe.)
(The deepcopy digression was a dead end — you can't deepcopy frozendicts, by design. So that went nowhere. But it wasn't necessary anyway, and I'm glad I tried it because it exposed the broken copying/pickling of pydot.Dot objects.)
Well, if pyparsing did not set a lock beacuse
This is something that is more easily/cleanly done in the calling code than in pyparsing itself, especially since there are so few parse-time entry points.
Then we can take the same position. parse_dot_data is the only dangerous place and we could just strongly annotate the fact that the function is not thread-safe.
Reopening, as I'm hopeful that #497 can result in a thread-safe Pydot after all.