MOTHBALLED-graphviz
MOTHBALLED-graphviz copied to clipboard
[Dot] dot -Xdot gives incorrect length in T record with utf-8 strings
Ported Issue from Mantis Original ID: 2573 Reported By: SNoiraud
SEVERITY: MAJOR Submitted: 2015-10-11 11:02:47
OS: UBUNTU TRUSTY
OS BUILD: 15.04
DESCRIPTION
With the following source :
digraph GRAMPS_graph
{
_a [ shape="box" style="solid" label=<<TABLE BORDER="0"><TR><TD>ëï éà€ùǜ<BR/>Next line</TD></TR></TABLE>> URL="P_a" ];
}
xdot gives 16 characters for "ëï éà€ùǜ" instead of 8.
STEPS TO REPRODUCE
Download the bug.dot file and do : dot -Txdot -o bug.out bug.dot
In the generated output, we get :
_ldraw_="F 14 11 -Times-Roman c 7 -#000000 T 15 27.3 -1 46 16 -ëï éà€ùǜ F 14 ...
instead of :
_ldraw_="F 14 11 -Times-Roman c 7 -#000000 T 15 27.3 -1 46 8 -ëï éà€ùǜ F 14 ..
I added the following on the 2573 mantis bug: For the moment, I use the following workaround in python :
num = self.read_int()
pos = self.buf.find("-", self.pos) + 1
npos = pos + num
# workaround for graphviz < 2.39
if float(dotversion_str) < float(2.39):
# we must find " F " if we have at least one utf8 char.
# if not found, this means the string is the last field in the buffer.
nb_utf8_chars = int(num - len("".join(i for i in self.buf[self.pos:npos] if ord(i)<128)))
if nb_utf8_chars > 0:
end_pos = self.buf.find(" F ", self.pos) + 1
if end_pos > 0:
npos = end_pos
# enf of workaround.
self.pos = npos
res = self.buf[pos:self.pos]
It works for all my cases.
Any news for UTF8 characters ?
Is this just an issue with -Txdot?
There are no issues that I can see when rendering your graph with -Tpng
I think so. I never tested other possibilities.
(Sorry for taking so long to respond.)
This is not a bug but the output is working as specified: The text consists of the n bytes following '-'. Graphviz takes a black-box approach to text. It has no idea how many characters are encoded, just the number of bytes in the encoding. It relies on other libraries to provide it with width and height information. Your workaround may handle the graphs you create, but it is certainly possible to have a label with a T opcode followed by an opcode different than F.
If you are using python, you can read the various xdot attributes as bytearrays rather than strings. Then the number of bytes specified in the T opcode will give you what you need to pull out the necessary bytes. This should also use less memory.