Special characters (e.g. &) are not escaped by sklearn.tree.export_graphviz
Describe the bug
Exporting a decision tree where the feature_names or class_names contain special characters (particularly &<>) results in invalid graphviz output, as those characters have specific meanings to graphviz. Escaping to &, < and > results in correct output. This can of course be done by the user but it's something I think scikit-learn should handle internally.
Steps/Code to Reproduce
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
target_names = ["setosa & 123", "versicolor", "virginca"]
# target_names = ["setosa & 123", "versicolor", "virginca"] # This one works
tree.export_graphviz(
clf,
out_file="tree.dot",
feature_names=iris.feature_names,
class_names=target_names,
filled=True,
special_characters=True,
)
Then run graphviz
dot tree.dot -Tsvg -o tree.svg
Expected Results
Graphviz successfully converts to SVG without error.
Actual Results
Error: not well-formed (invalid token) in line 1
... <br/>class = setosa & 123 ...
in label of node 0
Error: not well-formed (invalid token) in line 1
... <br/>class = setosa & 123 ...
in label of node 1
Although SVG output is written to disk it is not correct.
Versions
System:
python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
executable: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/bin/python3
machine: Linux-5.15.0-92-generic-x86_64-with-glibc2.29
Python dependencies:
sklearn: 1.3.2
pip: 23.3.2
setuptools: 69.0.3
numpy: 1.24.4
scipy: 1.10.1
Cython: None
pandas: 2.0.3
matplotlib: 3.7.4
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 16
prefix: libgomp
filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libopenblas
filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
version: 0.3.21
threading_layer: pthreads
architecture: Zen
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libopenblas
filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: Zen
Hi @domdfcoding can I contribute towards resolving this bug? I'm a first-timer interested in contributing to this project.
So this is a bug, but at the same I think its priority is rather low:
- using
tree.plot_treeis recommended instead oftree.export_graphviz. If there are things that you can not do or don't like withtree.plot_tree, I would say that investing time in improvingtree.plot_treemay be a more useful thing to do - if you really need to use graphviz output, a reasonable work-around which is to escape special characters in
target_names.
I am a bit worried about trying to support complicated things in the dot output. If this is a simple replacement for a few characters &, < and > why not. If you need to read the dot format spec for a few days and cover all the edge cases, I don't think this is worth our time.