MOTHBALLED-graphviz [Dot] dot -Xdot gives incorrect length in T record with utf-8 strings

Ported Issue from Mantis Original ID: 2573 Reported By: SNoiraud

SEVERITY: MAJOR Submitted: 2015-10-11 11:02:47

OS: UBUNTU TRUSTY

OS BUILD: 15.04

DESCRIPTION

With the following source :

digraph GRAMPS_graph
{
  _a [ shape="box" style="solid" label=<<TABLE BORDER="0"><TR><TD>ëï éà€ùǜ<BR/>Next line</TD></TR></TABLE>> URL="P_a" ];
}

xdot gives 16 characters for "ëï éà€ùǜ" instead of 8.

STEPS TO REPRODUCE

Download the bug.dot file and do : dot -Txdot -o bug.out bug.dot

In the generated output, we get : _ldraw_="F 14 11 -Times-Roman c 7 -#000000 T 15 27.3 -1 46 16 -ëï éà€ùǜ F 14 ... instead of : _ldraw_="F 14 11 -Times-Roman c 7 -#000000 T 15 27.3 -1 46 8 -ëï éà€ùǜ F 14 ..

Jul 04 '16 08:07 GadgetSteve

I added the following on the 2573 mantis bug: For the moment, I use the following workaround in python :

    num = self.read_int()
    pos = self.buf.find("-", self.pos) + 1
    npos = pos + num
    # workaround for graphviz < 2.39
    if float(dotversion_str) < float(2.39):
        # we must find " F " if we have at least one utf8 char.
        # if not found, this means the string is the last field in the buffer.
        nb_utf8_chars = int(num - len("".join(i for i in self.buf[self.pos:npos] if ord(i)<128)))
        if nb_utf8_chars > 0:
            end_pos = self.buf.find(" F ", self.pos) + 1
            if end_pos > 0:
                npos = end_pos
    # enf of workaround.
    self.pos = npos
    res = self.buf[pos:self.pos]

It works for all my cases.

Any news for UTF8 characters ?

Aug 24 '17 08:08 SNoiraud

Is this just an issue with -Txdot?

There are no issues that I can see when rendering your graph with -Tpng

Aug 24 '17 12:08 ellson

I think so. I never tested other possibilities.

Aug 25 '17 10:08 SNoiraud

(Sorry for taking so long to respond.)

This is not a bug but the output is working as specified: The text consists of the n bytes following '-'. Graphviz takes a black-box approach to text. It has no idea how many characters are encoded, just the number of bytes in the encoding. It relies on other libraries to provide it with width and height information. Your workaround may handle the graphs you create, but it is certainly possible to have a label with a T opcode followed by an opcode different than F.

If you are using python, you can read the various xdot attributes as bytearrays rather than strings. Then the number of bytes specified in the T opcode will give you what you need to pull out the necessary bytes. This should also use less memory.

Aug 30 '17 15:08 emden

MOTHBALLED-graphviz MOTHBALLED-graphviz copied to clipboard

[Dot] dot -Xdot gives incorrect length in T record with utf-8 strings

DESCRIPTION

STEPS TO REPRODUCE

MOTHBALLED-graphviz
MOTHBALLED-graphviz copied to clipboard