go-smaz icon indicating copy to clipboard operation
go-smaz copied to clipboard

Improve search

Open klauspost opened this issue 2 years ago • 0 comments

Searching doesn't account for the cost of switching between literals and dictionary entries.

For example "1 2 3 4 5 6 7 8 9" is enlarged by 52% (17 -> 26 bytes) since it 1 as a literal (2 bytes), " " (space) as a dictionary entry (1 byte), then 2 as literal, etc. It shouldn't switch if the next entry isn't in the dictionary, since you always pay at least 1, sometimes 2 byte(s) to output a literal.

This adds a mostly effective check to see whether the next match is a literal or a dictionary entry. It should catch most cases where switching is ineffective, without doing an an exhaustive search.

Finally it checks the final output length against pure literals.

Differences compared to current (before/after):

    smaz_test.go:113: "not-a-g00d-Exampl333" enlarged by 15% (20 -> 23 bytes)
    smaz_test.go:113: "not-a-g00d-Exampl333" enlarged by 10% (20 -> 22 bytes)

    smaz_test.go:115: "1000 numbers 2000 will 10 20 30 compress very little" compressed by 10% (52 -> 47 bytes)
    smaz_test.go:115: "1000 numbers 2000 will 10 20 30 compress very little" compressed by 18% (52 -> 43 bytes)

    smaz_test.go:115: "Mi illumino di immenso" compressed by 5% (22 -> 21 bytes)
    smaz_test.go:115: "Mi illumino di immenso" compressed by 37% (22 -> 14 bytes)

    smaz_test.go:115: "L'autore di questa libreria vive in Sicilia" compressed by 5% (43 -> 41 bytes)
    smaz_test.go:115: "L'autore di questa libreria vive in Sicilia" compressed by 28% (43 -> 31 bytes)

    smaz_test.go:113: "1 2 3 4 5 6 7 8 9" enlarged by 52% (17 -> 26 bytes)
    smaz_test.go:113: "1 2 3 4 5 6 7 8 9" enlarged by 11% (17 -> 19 bytes)

    smaz_test.go:113: "11 22 33 44 55 66 77 88 99" enlarged by 69% (26 -> 44 bytes)
    smaz_test.go:113: "11 22 33 44 55 66 77 88 99" enlarged by 7% (26 -> 28 bytes)

    smaz_test.go:115: "whichwhichwhichwhichwhichwhichwhichwhichwhichwhich11 22 33 44 55 66 77 88 99" compressed by 29% (76 -> 54 bytes)
    smaz_test.go:115: "whichwhichwhichwhichwhichwhichwhichwhichwhichwhich11 22 33 44 55 66 77 88 99" compressed by 50% (76 -> 38 bytes)

    smaz_test.go:115: "whichwhichwhichwhichwhichwhichwhichwhichwhichwhich1 2 3 4 5 6 7 8 9" compressed by 47% (67 -> 36 bytes)
    smaz_test.go:115: "whichwhichwhichwhichwhichwhichwhichwhichwhichwhich1 2 3 4 5 6 7 8 9" compressed by 57% (67 -> 29 bytes)

As mentioned, this only looks a single byte ahead. It could try to resolve further forward for optimal compression, but this catches most bad cases.

klauspost avatar Sep 05 '22 16:09 klauspost