mkdocs_puml icon indicating copy to clipboard operation
mkdocs_puml copied to clipboard

Encoder Doesn't Match PlantUML Output

Open OnceUponALoop opened this issue 1 year ago • 2 comments

Describe the bug encoder.py can't replicate PlantUML encoding.

To Reproduce Run pytest - see result

Expected behavior Passing test

Screenshots

=================================================== FAILURES ====================================================
__________________________________________________ test_encode __________________________________________________

diagram_and_encoded = ('@startuml\nBob -> Alice : hello\n@enduml', 'SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKi1IW80')

    def test_encode(diagram_and_encoded: tuple[str, str]):
        # Ensures the encoded output matches the expected result.
        diagram, expected = diagram_and_encoded
        diagram = diagram.strip()
        print(f"Diagram: {diagram}")
        encoded = encode(f"{diagram}")
    
>       assert encoded == expected
E       AssertionError: assert 'SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKifpSq100==' == 'SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKi1IW80'
E         
E         - SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKi1IW80
E         ?                                                 ---
E         + SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKifpSq100==
E         ?                                                ++++  +++

tests/test_encoder.py:11: AssertionError
--------------------------------------------- Captured stdout call ----------------------------------------------
Diagram: @startuml
Bob -> Alice : hello
@enduml


============================================ short test summary info ============================================
FAILED tests/test_encoder.py::test_encode - AssertionError: assert 'SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKifpSq100==' == 'SoWkIImgAStDuNBAJrBGjLDmpCbCJbMmKiX8pSd9vt98pKi1IW80'

Desktop (please complete the following information):

  • OS: Fedora 49
  • Dependency versions: mkdocs == 1.6

Additional context Add any other context about the problem here.

OnceUponALoop avatar Sep 07 '24 04:09 OnceUponALoop

@MikhailKravets I had a whole writeup here that I somehow didn't submit it so here's an attempt at recreating it a few days later.

I looked into this and it seems that there's no real consensus on this custom hash here's what I have

Conclusion

Considering this is a mile long, I'll start with the conclusion.

We can strip out the markers @startuml/@enduml and hash the content, that should match the PlantUML implementation and reduce the payload size (at the cost of regex parsing).

The very last script in the writeup (encoder_v2.py) implements this, but it's not a good implementation, it was a 1am bash-head-against-keyboard kind of implementation and can probably be cleaned up significantly.

Test Data

I'm testing with the following, defined in a temporary temp.puml file.

@startuml
Alice -> Bob: test
Bob -> Alice: test
@enduml

PlantUML Results

Java

I ran plantuml with the -encodeurl option

java -jar ~/downloads/plantuml-1.2024.6.jar -encodeurl temp.puml

Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000

Web

Here's where things get interesting, when I paste it into PlantUML server I get the following

SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00

Now if I add a newline like my temp.puml I get a different result

SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWf8DG00

Finally if I run it just with the drawing without markers I get this

Syp9J4vLqBLJSCfFib8eIIqkuGAoG09W6OWG0000

None of these match what the jar file gave us.

Encoding Samples

I went through the encoding samples that plantuml has on their site and tried to run this sample against them.

Perl

encode_pl.pl

Expand for source
use strict;
use warnings;
use Encode qw(encode);
use Compress::Zlib qw(deflateInit);
use MIME::Base64;

sub utf8_encode {
    return encode('UTF-8', $_[0]);
}

sub _compress_with_deflate {
    my ($data) = @_;
    my $deflate = deflateInit(-Level => 9, -WindowBits => -15);  # -15 for raw deflate
    my $compressed = $deflate->deflate($data);
    $compressed .= $deflate->flush();
    return $compressed;
}

sub encode6bit {
    my $b = $_[0];
    if ($b < 10) {
        return chr(48 + $b);
    }
    $b -= 10;
    if ($b < 26) {
        return chr(65 + $b);
    }
    $b -= 26;
    if ($b < 26) {
        return chr(97 + $b);
    }
    $b -= 26;
    if ($b == 0) {
        return '-';
    }
    if ($b == 1) {
        return '_';
    }
    return '?';
}

sub append3bytes {
    my ($b1, $b2, $b3) = @_;
    my $c1 = $b1 >> 2;
    my $c2 = (($b1 & 0x3) << 4) | ($b2 >> 4);
    my $c3 = (($b2 & 0xF) << 2) | ($b3 >> 6);
    my $c4 = $b3 & 0x3F;
    my $r = "";
    $r .= encode6bit($c1 & 0x3F);
    $r .= encode6bit($c2 & 0x3F);
    $r .= encode6bit($c3 & 0x3F);
    $r .= encode6bit($c4 & 0x3F);
    return $r;
}

sub encode64 {
    my $c = $_[0];
    my $str = "";
    my $len = length $c;
    for (my $i = 0; $i < $len; $i += 3) {
        if ($i + 2 == $len) {
            $str .= append3bytes(ord(substr($c, $i, 1)), ord(substr($c, $i + 1, 1)), 0);
        }
        elsif ($i + 1 == $len) {
            $str .= append3bytes(ord(substr($c, $i, 1)), 0, 0);
        }
        else {
            $str .= append3bytes(
                ord(substr($c, $i, 1)),
                ord(substr($c, $i + 1, 1)),
                ord(substr($c, $i + 2, 1))
            );
        }
    }
    return $str;
}

sub encode_p {
    my $data = utf8_encode($_[0]);
    print "UTF-8 encoded: ", unpack("H*", $data), "\n";
    my $compressed = _compress_with_deflate($data);
    print "Compressed: ", unpack("H*", $compressed), "\n";
    return encode64($compressed);
}

sub run_test {
    my ($test_content, $expected) = @_;
    my $encoded = encode_p($test_content);
    print "Test content: $test_content\n";
    print "Encoded : $encoded\n";
    print "Expected: $expected\n";
    print "Matches: ", ($encoded eq $expected ? "True" : "False"), "\n\n";
}

# Test case 1: With @startuml and @enduml
my $test_content1 = "\@startuml\nAlice -> Bob: test\nBob -> Alice: test\n\@enduml";
run_test($test_content1, "SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00");

# Test case 2: Without @startuml and @enduml
my $test_content2 = "Alice -> Bob: test\nBob -> Alice: test";
run_test($test_content2, "Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000");

The results are

perl encode_pl.pl                                               
UTF-8 encoded: 407374617274756d6c0a416c696365202d3e20426f623a20746573740a426f62202d3e20416c6963653a20746573740a40656e64756d6c
Compressed: 73282e492c2a29cdcde172ccc94c4e55d0b55370ca4fb25228492d2ee102b240026019a890436a5e4a696e0e00
Test content: @startuml
Alice -> Bob: test
Bob -> Alice: test
@enduml
Encoded : SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNaffRWu0
Expected: SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00
Matches: False

UTF-8 encoded: 416c696365202d3e20426f623a20746573740a426f62202d3e20416c6963653a2074657374
Compressed: 73ccc94c4e55d0b55370ca4fb25228492d2ee102b240028e2019881000
Test content: Alice -> Bob: test
Bob -> Alice: test
Encoded : Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Expected: Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Matches: True

PHP

encode_php.php

Expand for source
<?php
function encodep($text) {
    $data = utf8_encode($text);
    echo "UTF-8 encoded: " . bin2hex($data) . "\n";
    $compressed = gzdeflate($data, 9);
    echo "Compressed: " . bin2hex($compressed) . "\n";
    return encode64($compressed);
}

function encode6bit($b) {
    if ($b < 10) {
        return chr(48 + $b);
    }
    $b -= 10;
    if ($b < 26) {
        return chr(65 + $b);
    }
    $b -= 26;
    if ($b < 26) {
        return chr(97 + $b);
    }
    $b -= 26;
    if ($b == 0) {
        return '-';
    }
    if ($b == 1) {
        return '_';
    }
    return '?';
}

function append3bytes($b1, $b2, $b3) {
    $c1 = $b1 >> 2;
    $c2 = (($b1 & 0x3) << 4) | ($b2 >> 4);
    $c3 = (($b2 & 0xF) << 2) | ($b3 >> 6);
    $c4 = $b3 & 0x3F;
    $r = "";
    $r .= encode6bit($c1 & 0x3F);
    $r .= encode6bit($c2 & 0x3F);
    $r .= encode6bit($c3 & 0x3F);
    $r .= encode6bit($c4 & 0x3F);
    return $r;
}

function encode64($c) {
    $str = "";
    $len = strlen($c);
    for ($i = 0; $i < $len; $i+=3) {
        if ($i+2==$len) {
            $str .= append3bytes(ord(substr($c, $i, 1)), ord(substr($c, $i+1, 1)), 0);
        } else if ($i+1==$len) {
            $str .= append3bytes(ord(substr($c, $i, 1)), 0, 0);
        } else {
            $str .= append3bytes(ord(substr($c, $i, 1)), ord(substr($c, $i+1, 1)),
                ord(substr($c, $i+2, 1)));
        }
    }
    return $str;
}

function run_test($test_content, $expected) {
    $encoded = encodep($test_content);
    echo "Test content: $test_content\n";
    echo "Encoded: " . $encoded . "\n";
    echo "Expected: $expected\n";
    echo "Matches: " . ($encoded === $expected ? "True" : "False") . "\n\n";
}

// Test case 1: With @startuml and @enduml
$test_content1 = "@startuml\nAlice -> Bob: test\nBob -> Alice: test\n@enduml";
run_test($test_content1, "SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00");

// Test case 2: Without @startuml and @enduml
$test_content2 = "Alice -> Bob: test\nBob -> Alice: test";
run_test($test_content2, "Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000");
?>

The output of which is

php encode_php.php
UTF-8 encoded: 407374617274756d6c0a416c696365202d3e20426f623a20746573740a426f62202d3e20416c6963653a20746573740a40656e64756d6c
Compressed: 73282e492c2a29cdcde172ccc94c4e55d0b55370ca4fb25228492d2ee102b240026019a890436a5e4a696e0e00
Test content: @startuml
Alice -> Bob: test
Bob -> Alice: test
@enduml
Encoded: SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNaffRWu0
Expected: SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00
Matches: False

UTF-8 encoded: 416c696365202d3e20426f623a20746573740a426f62202d3e20416c6963653a2074657374
Compressed: 73ccc94c4e55d0b55370ca4fb25228492d2ee102b240028e2019881000
Test content: Alice -> Bob: test
Bob -> Alice: test
Encoded: Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Expected: Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Matches: True

Poking Around

I'm inclined to standardize on the jar output, so I went looking at the PlantUML source. The first thing I noticed was that it strips out the markers before hashing. That explains why its hash output is shorter, and would match the behavior we saw in the php script.

Before going down that path I wanted to first try and understand why this was happening, so I wasted a bunch of time trying to logic around it.

encoder_v1.py

Expand for source
import zlib

# Base64-like encoding using PlantUML's specific character set
encode6bit = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_"

def encode_6bit(b):
    return encode6bit[b & 0x3F]

def append_3bytes(b1, b2, b3):
    c1 = b1 >> 2
    c2 = ((b1 & 0x3) << 4) | (b2 >> 4)
    c3 = ((b2 & 0xF) << 2) | (b3 >> 6)
    c4 = b3 & 0x3F
    
    print(f"\n=========================================================================\n")
    print(f"Input bytes: {b1:02x} {b2:02x} {b3:02x}")
    print(f"Binary representation: {b1:08b} {b2:08b} {b3:08b}")
    
    print(f"\nCalculations:")
    print(f"c1 = b1 >> 2:                    {b1:08b} >> 2 = {c1:06b} ({c1})")
    print(f"c2 = ((b1 & 0x3) << 4) | (b2 >> 4):")
    print(f"    (b1 & 0x3):                  {b1:08b} & 00000011 = {b1 & 0x3:08b}")
    print(f"    (b1 & 0x3) << 4:             {(b1 & 0x3) << 4:08b}")
    print(f"    b2 >> 4:                     {b2:08b} >> 4 = {b2 >> 4:04b}")
    print(f"    Combined:                    {c2:06b} ({c2})")
    print(f"c3 = ((b2 & 0xF) << 2) | (b3 >> 6):")
    print(f"    (b2 & 0xF):                  {b2:08b} & 00001111 = {b2 & 0xF:04b}")
    print(f"    (b2 & 0xF) << 2:             {(b2 & 0xF) << 2:06b}")
    print(f"    b3 >> 6:                     {b3:08b} >> 6 = {b3 >> 6:02b}")
    print(f"    Combined:                    {c3:06b} ({c3})")
    print(f"c4 = b3 & 0x3F:                  {b3:08b} & 00111111 = {c4:06b} ({c4})")
    
    print(f"\nEncoding:")
    print(f"Encoded c1 ({c1:06b}): {encode_6bit(c1)}")
    print(f"Encoded c2 ({c2:06b}): {encode_6bit(c2)}")
    print(f"Encoded c3 ({c3:06b}): {encode_6bit(c3)}")
    print(f"Encoded c4 ({c4:06b}): {encode_6bit(c4)}")
    
    result = encode_6bit(c1) + encode_6bit(c2) + encode_6bit(c3) + encode_6bit(c4)
    print(f"\nFinal result: {result}")
    
    return result

def encode64(data):
    result = []
    print(f"Compressed data (hex): {data.hex()}")
    print(f"Compressed data (bytes): {list(data)}")
    for i in range(0, len(data), 3):
        if i + 2 < len(data):
            result.append(append_3bytes(data[i], data[i+1], data[i+2]))
        elif i + 1 < len(data):
            print(f"Processing last 2 bytes: {data[i]:02x} {data[i+1]:02x}")
            result.append(append_3bytes(data[i], data[i+1], 0))
        else:
            print(f"Processing last byte: {data[i]:02x}")
            result.append(append_3bytes(data[i], 0, 0))
    return ''.join(result)

def encode(text):
    # Clean the input (trim whitespace)
    cleaned_text = text.strip()

    print("--------------------------------------------------------------------")
    print(f"Cleaned text:")
    print(cleaned_text)
    print("--------------------------------------------------------------------")

    # Compress the entire cleaned text using raw DEFLATE
    compressor = zlib.compressobj(level=9, wbits=-15)  # -15 for raw DEFLATE
    compressed_data = compressor.compress(cleaned_text.encode('utf-8'))
    compressed_data += compressor.flush()

    print(f"Compressed data (hex): {compressed_data.hex()}")

    # Encode the compressed data
    encoded = encode64(compressed_data)

    print(f"Encoded: {encoded}")

    return encoded

def run_test(test_content, expected):
    encoded = encode(test_content)
    print("\n--------------------------------------------------------------------")
    print(f"Test content: {test_content}")
    print("--------------------------------------------------------------------")
    print(f"Encoded : {encoded}")
    print(f"Expected: {expected}")
    print(f"Matches : {encoded == expected}")

# Example test case
if __name__ == "__main__":
    test_content = "@startuml\nAlice -> Bob: test\nBob -> Alice: test\n@enduml"
    run_test(test_content, "SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00")


And the output, it's a little much but lets chalk it up to a learning exercise.

Expand for output log
python encode_test.py 
--------------------------------------------------------------------
Cleaned text:
@startuml
Alice -> Bob: test
Bob -> Alice: test
@enduml
--------------------------------------------------------------------
Compressed data (hex): 73282e492c2a29cdcde172ccc94c4e55d0b55370ca4fb25228492d2ee102b240026019a890436a5e4a696e0e00
Compressed data (hex): 73282e492c2a29cdcde172ccc94c4e55d0b55370ca4fb25228492d2ee102b240026019a890436a5e4a696e0e00
Compressed data (bytes): [115, 40, 46, 73, 44, 42, 41, 205, 205, 225, 114, 204, 201, 76, 78, 85, 208, 181, 83, 112, 202, 79, 178, 82, 40, 73, 45, 46, 225, 2, 178, 64, 2, 96, 25, 168, 144, 67, 106, 94, 74, 105, 110, 14, 0]

=========================================================================

Input bytes: 73 28 2e
Binary representation: 01110011 00101000 00101110

Calculations:
c1 = b1 >> 2:                    01110011 >> 2 = 011100 (28)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01110011 & 00000011 = 00000011
    (b1 & 0x3) << 4:             00110000
    b2 >> 4:                     00101000 >> 4 = 0010
    Combined:                    110010 (50)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  00101000 & 00001111 = 1000
    (b2 & 0xF) << 2:             100000
    b3 >> 6:                     00101110 >> 6 = 00
    Combined:                    100000 (32)
c4 = b3 & 0x3F:                  00101110 & 00111111 = 101110 (46)

Encoding:
Encoded c1 (011100): S
Encoded c2 (110010): o
Encoded c3 (100000): W
Encoded c4 (101110): k

Final result: SoWk

=========================================================================

Input bytes: 49 2c 2a
Binary representation: 01001001 00101100 00101010

Calculations:
c1 = b1 >> 2:                    01001001 >> 2 = 010010 (18)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01001001 & 00000011 = 00000001
    (b1 & 0x3) << 4:             00010000
    b2 >> 4:                     00101100 >> 4 = 0010
    Combined:                    010010 (18)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  00101100 & 00001111 = 1100
    (b2 & 0xF) << 2:             110000
    b3 >> 6:                     00101010 >> 6 = 00
    Combined:                    110000 (48)
c4 = b3 & 0x3F:                  00101010 & 00111111 = 101010 (42)

Encoding:
Encoded c1 (010010): I
Encoded c2 (010010): I
Encoded c3 (110000): m
Encoded c4 (101010): g

Final result: IImg

=========================================================================

Input bytes: 29 cd cd
Binary representation: 00101001 11001101 11001101

Calculations:
c1 = b1 >> 2:                    00101001 >> 2 = 001010 (10)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  00101001 & 00000011 = 00000001
    (b1 & 0x3) << 4:             00010000
    b2 >> 4:                     11001101 >> 4 = 1100
    Combined:                    011100 (28)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  11001101 & 00001111 = 1101
    (b2 & 0xF) << 2:             110100
    b3 >> 6:                     11001101 >> 6 = 11
    Combined:                    110111 (55)
c4 = b3 & 0x3F:                  11001101 & 00111111 = 001101 (13)

Encoding:
Encoded c1 (001010): A
Encoded c2 (011100): S
Encoded c3 (110111): t
Encoded c4 (001101): D

Final result: AStD

=========================================================================

Input bytes: e1 72 cc
Binary representation: 11100001 01110010 11001100

Calculations:
c1 = b1 >> 2:                    11100001 >> 2 = 111000 (56)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  11100001 & 00000011 = 00000001
    (b1 & 0x3) << 4:             00010000
    b2 >> 4:                     01110010 >> 4 = 0111
    Combined:                    010111 (23)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01110010 & 00001111 = 0010
    (b2 & 0xF) << 2:             001000
    b3 >> 6:                     11001100 >> 6 = 11
    Combined:                    001011 (11)
c4 = b3 & 0x3F:                  11001100 & 00111111 = 001100 (12)

Encoding:
Encoded c1 (111000): u
Encoded c2 (010111): N
Encoded c3 (001011): B
Encoded c4 (001100): C

Final result: uNBC

=========================================================================

Input bytes: c9 4c 4e
Binary representation: 11001001 01001100 01001110

Calculations:
c1 = b1 >> 2:                    11001001 >> 2 = 110010 (50)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  11001001 & 00000011 = 00000001
    (b1 & 0x3) << 4:             00010000
    b2 >> 4:                     01001100 >> 4 = 0100
    Combined:                    010100 (20)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01001100 & 00001111 = 1100
    (b2 & 0xF) << 2:             110000
    b3 >> 6:                     01001110 >> 6 = 01
    Combined:                    110001 (49)
c4 = b3 & 0x3F:                  01001110 & 00111111 = 001110 (14)

Encoding:
Encoded c1 (110010): o
Encoded c2 (010100): K
Encoded c3 (110001): n
Encoded c4 (001110): E

Final result: oKnE

=========================================================================

Input bytes: 55 d0 b5
Binary representation: 01010101 11010000 10110101

Calculations:
c1 = b1 >> 2:                    01010101 >> 2 = 010101 (21)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01010101 & 00000011 = 00000001
    (b1 & 0x3) << 4:             00010000
    b2 >> 4:                     11010000 >> 4 = 1101
    Combined:                    011101 (29)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  11010000 & 00001111 = 0000
    (b2 & 0xF) << 2:             000000
    b3 >> 6:                     10110101 >> 6 = 10
    Combined:                    000010 (2)
c4 = b3 & 0x3F:                  10110101 & 00111111 = 110101 (53)

Encoding:
Encoded c1 (010101): L
Encoded c2 (011101): T
Encoded c3 (000010): 2
Encoded c4 (110101): r

Final result: LT2r

=========================================================================

Input bytes: 53 70 ca
Binary representation: 01010011 01110000 11001010

Calculations:
c1 = b1 >> 2:                    01010011 >> 2 = 010100 (20)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01010011 & 00000011 = 00000011
    (b1 & 0x3) << 4:             00110000
    b2 >> 4:                     01110000 >> 4 = 0111
    Combined:                    110111 (55)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01110000 & 00001111 = 0000
    (b2 & 0xF) << 2:             000000
    b3 >> 6:                     11001010 >> 6 = 11
    Combined:                    000011 (3)
c4 = b3 & 0x3F:                  11001010 & 00111111 = 001010 (10)

Encoding:
Encoded c1 (010100): K
Encoded c2 (110111): t
Encoded c3 (000011): 3
Encoded c4 (001010): A

Final result: Kt3A

=========================================================================

Input bytes: 4f b2 52
Binary representation: 01001111 10110010 01010010

Calculations:
c1 = b1 >> 2:                    01001111 >> 2 = 010011 (19)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01001111 & 00000011 = 00000011
    (b1 & 0x3) << 4:             00110000
    b2 >> 4:                     10110010 >> 4 = 1011
    Combined:                    111011 (59)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  10110010 & 00001111 = 0010
    (b2 & 0xF) << 2:             001000
    b3 >> 6:                     01010010 >> 6 = 01
    Combined:                    001001 (9)
c4 = b3 & 0x3F:                  01010010 & 00111111 = 010010 (18)

Encoding:
Encoded c1 (010011): J
Encoded c2 (111011): x
Encoded c3 (001001): 9
Encoded c4 (010010): I

Final result: Jx9I

=========================================================================

Input bytes: 28 49 2d
Binary representation: 00101000 01001001 00101101

Calculations:
c1 = b1 >> 2:                    00101000 >> 2 = 001010 (10)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  00101000 & 00000011 = 00000000
    (b1 & 0x3) << 4:             00000000
    b2 >> 4:                     01001001 >> 4 = 0100
    Combined:                    000100 (4)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01001001 & 00001111 = 1001
    (b2 & 0xF) << 2:             100100
    b3 >> 6:                     00101101 >> 6 = 00
    Combined:                    100100 (36)
c4 = b3 & 0x3F:                  00101101 & 00111111 = 101101 (45)

Encoding:
Encoded c1 (001010): A
Encoded c2 (000100): 4
Encoded c3 (100100): a
Encoded c4 (101101): j

Final result: A4aj

=========================================================================

Input bytes: 2e e1 02
Binary representation: 00101110 11100001 00000010

Calculations:
c1 = b1 >> 2:                    00101110 >> 2 = 001011 (11)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  00101110 & 00000011 = 00000010
    (b1 & 0x3) << 4:             00100000
    b2 >> 4:                     11100001 >> 4 = 1110
    Combined:                    101110 (46)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  11100001 & 00001111 = 0001
    (b2 & 0xF) << 2:             000100
    b3 >> 6:                     00000010 >> 6 = 00
    Combined:                    000100 (4)
c4 = b3 & 0x3F:                  00000010 & 00111111 = 000010 (2)

Encoding:
Encoded c1 (001011): B
Encoded c2 (101110): k
Encoded c3 (000100): 4
Encoded c4 (000010): 2

Final result: Bk42

=========================================================================

Input bytes: b2 40 02
Binary representation: 10110010 01000000 00000010

Calculations:
c1 = b1 >> 2:                    10110010 >> 2 = 101100 (44)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  10110010 & 00000011 = 00000010
    (b1 & 0x3) << 4:             00100000
    b2 >> 4:                     01000000 >> 4 = 0100
    Combined:                    100100 (36)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01000000 & 00001111 = 0000
    (b2 & 0xF) << 2:             000000
    b3 >> 6:                     00000010 >> 6 = 00
    Combined:                    000000 (0)
c4 = b3 & 0x3F:                  00000010 & 00111111 = 000010 (2)

Encoding:
Encoded c1 (101100): i
Encoded c2 (100100): a
Encoded c3 (000000): 0
Encoded c4 (000010): 2

Final result: ia02

=========================================================================

Input bytes: 60 19 a8
Binary representation: 01100000 00011001 10101000

Calculations:
c1 = b1 >> 2:                    01100000 >> 2 = 011000 (24)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01100000 & 00000011 = 00000000
    (b1 & 0x3) << 4:             00000000
    b2 >> 4:                     00011001 >> 4 = 0001
    Combined:                    000001 (1)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  00011001 & 00001111 = 1001
    (b2 & 0xF) << 2:             100100
    b3 >> 6:                     10101000 >> 6 = 10
    Combined:                    100110 (38)
c4 = b3 & 0x3F:                  10101000 & 00111111 = 101000 (40)

Encoding:
Encoded c1 (011000): O
Encoded c2 (000001): 1
Encoded c3 (100110): c
Encoded c4 (101000): e

Final result: O1ce

=========================================================================

Input bytes: 90 43 6a
Binary representation: 10010000 01000011 01101010

Calculations:
c1 = b1 >> 2:                    10010000 >> 2 = 100100 (36)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  10010000 & 00000011 = 00000000
    (b1 & 0x3) << 4:             00000000
    b2 >> 4:                     01000011 >> 4 = 0100
    Combined:                    000100 (4)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01000011 & 00001111 = 0011
    (b2 & 0xF) << 2:             001100
    b3 >> 6:                     01101010 >> 6 = 01
    Combined:                    001101 (13)
c4 = b3 & 0x3F:                  01101010 & 00111111 = 101010 (42)

Encoding:
Encoded c1 (100100): a
Encoded c2 (000100): 4
Encoded c3 (001101): D
Encoded c4 (101010): g

Final result: a4Dg

=========================================================================

Input bytes: 5e 4a 69
Binary representation: 01011110 01001010 01101001

Calculations:
c1 = b1 >> 2:                    01011110 >> 2 = 010111 (23)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01011110 & 00000011 = 00000010
    (b1 & 0x3) << 4:             00100000
    b2 >> 4:                     01001010 >> 4 = 0100
    Combined:                    100100 (36)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  01001010 & 00001111 = 1010
    (b2 & 0xF) << 2:             101000
    b3 >> 6:                     01101001 >> 6 = 01
    Combined:                    101001 (41)
c4 = b3 & 0x3F:                  01101001 & 00111111 = 101001 (41)

Encoding:
Encoded c1 (010111): N
Encoded c2 (100100): a
Encoded c3 (101001): f
Encoded c4 (101001): f

Final result: Naff

=========================================================================

Input bytes: 6e 0e 00
Binary representation: 01101110 00001110 00000000

Calculations:
c1 = b1 >> 2:                    01101110 >> 2 = 011011 (27)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4):
    (b1 & 0x3):                  01101110 & 00000011 = 00000010
    (b1 & 0x3) << 4:             00100000
    b2 >> 4:                     00001110 >> 4 = 0000
    Combined:                    100000 (32)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6):
    (b2 & 0xF):                  00001110 & 00001111 = 1110
    (b2 & 0xF) << 2:             111000
    b3 >> 6:                     00000000 >> 6 = 00
    Combined:                    111000 (56)
c4 = b3 & 0x3F:                  00000000 & 00111111 = 000000 (0)

Encoding:
Encoded c1 (011011): R
Encoded c2 (100000): W
Encoded c3 (111000): u
Encoded c4 (000000): 0

Final result: RWu0
Encoded: SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNaffRWu0

--------------------------------------------------------------------
Test content: @startuml
Alice -> Bob: test
Bob -> Alice: test
@enduml
--------------------------------------------------------------------
Encoded : SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNaffRWu0
Expected: SoWkIImgAStDuNBCoKnELT2rKt3AJx9IA4ajBk42ia02O1cea4DgNWfGDG00
Matches : False

I still couldn't figure out why I could never get it closer, I threw in the towel and went back to implementing it by replicating the java code. Here's my attempt at it

encoder_v2.py

Expand for source
import zlib
import re

# String cleaning logic based on the Java ArobaseStringCompressor
class ArobaseStringCompressor:
    pattern = re.compile(r"(?s)^[\s]*(@startuml[^\n\r]*)?[\s]*(.*?)[\s]*(@enduml)?[\s]*$")

    def compress(self, data):
        lines = data.splitlines()
        result = []
        start_done = False

        for line in lines:
            if line.startswith("@startuml"):
                start_done = True
            elif line.startswith("@enduml"):
                return '\n'.join(result)
            elif start_done:
                result.append(line)

        if not start_done:
            return self.compress_old('\n'.join(lines))
        return '\n'.join(result)

    def compress_old(self, s):
        match = self.pattern.match(s)
        if match:
            return self.clean(match.group(2))
        return ""

    def clean(self, s):
        s = s.strip()
        s = re.sub(r"@enduml[^\n\r]*", "", s)
        s = re.sub(r"@startuml[^\n\r]*", "", s)
        return s.strip()

# Custom Deflate Compressor
class CustomDeflate:
    def compress(self, data):
        compressor = zlib.compressobj(level=9, wbits=-15)
        compressed = compressor.compress(data) + compressor.flush()

        truncate_index = compressed.find(b'\x0e\x00')
        if truncate_index != -1:
            compressed = compressed[:truncate_index]
        return compressed

# Base64-like encoding using PlantUML's custom character set
def encode_6bit(b):
    if b < 10:
        return chr(48 + b)
    b -= 10
    if b < 26:
        return chr(65 + b)
    b -= 26
    if b < 26:
        return chr(97 + b)
    b -= 26
    if b == 0:
        return '-'
    if b == 1:
        return '_'
    return '?'

def append_3bytes(b1, b2, b3):
    c1 = b1 >> 2
    c2 = ((b1 & 0x3) << 4) | (b2 >> 4)
    c3 = ((b2 & 0xF) << 2) | (b3 >> 6)
    c4 = b3 & 0x3F

    print(f"\nProcessing bytes: {b1:02x} {b2:02x} {b3:02x}")
    print(f"b1 (binary): {b1:08b}")
    print(f"b2 (binary): {b2:08b}")
    print(f"b3 (binary): {b3:08b}")
    print(f"c1 = b1 >> 2: {c1:06b} ({c1})")
    print(f"c2 = ((b1 & 0x3) << 4) | (b2 >> 4): {c2:06b} ({c2})")
    print(f"c3 = ((b2 & 0xF) << 2) | (b3 >> 6): {c3:06b} ({c3})")
    print(f"c4 = b3 & 0x3F: {c4:06b} ({c4})")
    encoded_c1 = encode_6bit(c1)
    encoded_c2 = encode_6bit(c2)
    encoded_c3 = encode_6bit(c3)
    encoded_c4 = encode_6bit(c4)
    print(f"Encoded c1: {encoded_c1} (from {c1})")
    print(f"Encoded c2: {encoded_c2} (from {c2})")
    print(f"Encoded c3: {encoded_c3} (from {c3})")
    print(f"Encoded c4: {encoded_c4} (from {c4})")
    result = encoded_c1 + encoded_c2 + encoded_c3 + encoded_c4
    print(f"Result: {result}")
    return result

def encode64(data):
    result = []
    for i in range(0, len(data), 3):
        if i + 2 < len(data):
            result.append(append_3bytes(data[i], data[i+1], data[i+2]))
        elif i + 1 < len(data):
            result.append(append_3bytes(data[i], data[i+1], 0))
        else:
            result.append(append_3bytes(data[i], 0, 0))
    return ''.join(result)

def encode(text):
    # Clean and compress the input using the ArobaseStringCompressor
    compressor = ArobaseStringCompressor()
    cleaned_text = compressor.compress(text)

    print(f"Cleaned text: {cleaned_text}")

    # Compress the diagram using the custom deflate compressor
    deflater = CustomDeflate()
    compressed = deflater.compress(cleaned_text.encode('utf-8'))

    print(f"Compressed data (hex): {compressed.hex()}")

    # Encode to PlantUML's base64-like format
    return encode64(compressed)

def run_test(test_content, expected):
    encoded = encode(test_content)
    print(f"\nTest content: {test_content}")
    print(f"Encoded : {encoded}")
    print(f"Expected: {expected}")
    print(f"Matches : {encoded == expected}\n")


# Example usage with the custom deflate and encoding process
if __name__ == "__main__":
    # Test case 1: Without @startuml and @enduml
    test_content1 = "Alice -> Bob: test\r\nBob -> Alice: test"
    run_test(test_content2, "Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000")

The output

Expand for output
Cleaned text: Alice -> Bob: test
Bob -> Alice: test
Compressed data (hex): 73ccc94c4e55d0b55370ca4fb25228492d2ee102b240028e2019881000

Processing bytes: 73 cc c9
b1 (binary): 01110011
b2 (binary): 11001100
b3 (binary): 11001001
c1 = b1 >> 2: 011100 (28)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 111100 (60)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 110011 (51)
c4 = b3 & 0x3F: 001001 (9)
Encoded c1: S (from 28)
Encoded c2: y (from 60)
Encoded c3: p (from 51)
Encoded c4: 9 (from 9)
Result: Syp9

Processing bytes: 4c 4e 55
b1 (binary): 01001100
b2 (binary): 01001110
b3 (binary): 01010101
c1 = b1 >> 2: 010011 (19)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 000100 (4)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 111001 (57)
c4 = b3 & 0x3F: 010101 (21)
Encoded c1: J (from 19)
Encoded c2: 4 (from 4)
Encoded c3: v (from 57)
Encoded c4: L (from 21)
Result: J4vL

Processing bytes: d0 b5 53
b1 (binary): 11010000
b2 (binary): 10110101
b3 (binary): 01010011
c1 = b1 >> 2: 110100 (52)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 001011 (11)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 010101 (21)
c4 = b3 & 0x3F: 010011 (19)
Encoded c1: q (from 52)
Encoded c2: B (from 11)
Encoded c3: L (from 21)
Encoded c4: J (from 19)
Result: qBLJ

Processing bytes: 70 ca 4f
b1 (binary): 01110000
b2 (binary): 11001010
b3 (binary): 01001111
c1 = b1 >> 2: 011100 (28)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 001100 (12)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 101001 (41)
c4 = b3 & 0x3F: 001111 (15)
Encoded c1: S (from 28)
Encoded c2: C (from 12)
Encoded c3: f (from 41)
Encoded c4: F (from 15)
Result: SCfF

Processing bytes: b2 52 28
b1 (binary): 10110010
b2 (binary): 01010010
b3 (binary): 00101000
c1 = b1 >> 2: 101100 (44)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 100101 (37)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 001000 (8)
c4 = b3 & 0x3F: 101000 (40)
Encoded c1: i (from 44)
Encoded c2: b (from 37)
Encoded c3: 8 (from 8)
Encoded c4: e (from 40)
Result: ib8e

Processing bytes: 49 2d 2e
b1 (binary): 01001001
b2 (binary): 00101101
b3 (binary): 00101110
c1 = b1 >> 2: 010010 (18)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 010010 (18)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 110100 (52)
c4 = b3 & 0x3F: 101110 (46)
Encoded c1: I (from 18)
Encoded c2: I (from 18)
Encoded c3: q (from 52)
Encoded c4: k (from 46)
Result: IIqk

Processing bytes: e1 02 b2
b1 (binary): 11100001
b2 (binary): 00000010
b3 (binary): 10110010
c1 = b1 >> 2: 111000 (56)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 010000 (16)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 001010 (10)
c4 = b3 & 0x3F: 110010 (50)
Encoded c1: u (from 56)
Encoded c2: G (from 16)
Encoded c3: A (from 10)
Encoded c4: o (from 50)
Result: uGAo

Processing bytes: 40 02 8e
b1 (binary): 01000000
b2 (binary): 00000010
b3 (binary): 10001110
c1 = b1 >> 2: 010000 (16)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 000000 (0)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 001010 (10)
c4 = b3 & 0x3F: 001110 (14)
Encoded c1: G (from 16)
Encoded c2: 0 (from 0)
Encoded c3: A (from 10)
Encoded c4: E (from 14)
Result: G0AE

Processing bytes: 20 19 88
b1 (binary): 00100000
b2 (binary): 00011001
b3 (binary): 10001000
c1 = b1 >> 2: 001000 (8)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 000001 (1)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 100110 (38)
c4 = b3 & 0x3F: 001000 (8)
Encoded c1: 8 (from 8)
Encoded c2: 1 (from 1)
Encoded c3: c (from 38)
Encoded c4: 8 (from 8)
Result: 81c8

Processing bytes: 10 00 00
b1 (binary): 00010000
b2 (binary): 00000000
b3 (binary): 00000000
c1 = b1 >> 2: 000100 (4)
c2 = ((b1 & 0x3) << 4) | (b2 >> 4): 000000 (0)
c3 = ((b2 & 0xF) << 2) | (b3 >> 6): 000000 (0)
c4 = b3 & 0x3F: 000000 (0)
Encoded c1: 4 (from 4)
Encoded c2: 0 (from 0)
Encoded c3: 0 (from 0)
Encoded c4: 0 (from 0)
Result: 4000

Test content: Alice -> Bob: test
Bob -> Alice: test
Encoded : Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Expected: Syp9J4vLqBLJSCfFib8eIIqkuGAoG0AE81c84000
Matches : True

OnceUponALoop avatar Sep 13 '24 03:09 OnceUponALoop

Thanks for your research @OnceUponALoop! That's actually a good idea. I'll research further the best approach to implement it.

MikhailKravets avatar Oct 02 '24 11:10 MikhailKravets

Hi @OnceUponALoop, not for all diagrams we can strip @startuml / @enduml tokens. For instance @startgantt shoud always be presents. If we implemented this in the package, the speedup would be marginal. The biggest difficulty is in the links inside PlantUML code. Most of them lead to GitHub and GitHub has rate limits.

Since version 2.0.0, the plugin includes caching functionality, which helps address the problem to a certain extent. Anyway, thank you for your efforts!

MikhailKravets avatar Oct 25 '24 15:10 MikhailKravets