scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Special characters (e.g. &) are not escaped by sklearn.tree.export_graphviz

Open domdfcoding opened this issue 1 year ago • 2 comments

Describe the bug

Exporting a decision tree where the feature_names or class_names contain special characters (particularly &<>) results in invalid graphviz output, as those characters have specific meanings to graphviz. Escaping to &amp;, &lt; and &gt; results in correct output. This can of course be done by the user but it's something I think scikit-learn should handle internally.

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

target_names = ["setosa & 123", "versicolor", "virginca"]
# target_names = ["setosa &amp; 123", "versicolor", "virginca"]  # This one works

tree.export_graphviz(
		clf,
		out_file="tree.dot",
		feature_names=iris.feature_names,
		class_names=target_names,
		filled=True,
		special_characters=True,
		)

Then run graphviz

dot tree.dot -Tsvg -o tree.svg 

Expected Results

Graphviz successfully converts to SVG without error.

Actual Results

Error: not well-formed (invalid token) in line 1 
... <br/>class = setosa & 123 ...
in label of node 0
Error: not well-formed (invalid token) in line 1 
... <br/>class = setosa & 123 ...
in label of node 1

Although SVG output is written to disk it is not correct. image

Versions

System:
    python: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0]
executable: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/bin/python3
   machine: Linux-5.15.0-92-generic-x86_64-with-glibc2.29

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.24.4
        scipy: 1.10.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.4
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Zen

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen

domdfcoding avatar Feb 01 '24 10:02 domdfcoding

Hi @domdfcoding can I contribute towards resolving this bug? I'm a first-timer interested in contributing to this project.

jatindyerawadekar avatar Feb 02 '24 22:02 jatindyerawadekar

So this is a bug, but at the same I think its priority is rather low:

  • using tree.plot_tree is recommended instead of tree.export_graphviz. If there are things that you can not do or don't like with tree.plot_tree, I would say that investing time in improving tree.plot_tree may be a more useful thing to do
  • if you really need to use graphviz output, a reasonable work-around which is to escape special characters in target_names.

I am a bit worried about trying to support complicated things in the dot output. If this is a simple replacement for a few characters &, < and > why not. If you need to read the dot format spec for a few days and cover all the edge cases, I don't think this is worth our time.

lesteve avatar Feb 08 '24 07:02 lesteve