httpteleport icon indicating copy to clipboard operation
httpteleport copied to clipboard

Add http-aware deduplication mode for httpteleport

Open valyala opened this issue 7 years ago • 3 comments

Smart http-aware deduplication could give much better compression ratio (and, probably, speed) comparing to general-purpose compression algorithms such as gzip or snappy. See this blog post from Cloudflare as a real-world example. It would be great to have a CompressType for http-aware deduplication in httpteleport. @klauspost, could you look into this?

valyala avatar Oct 18 '16 22:10 valyala

It should be fairly pluggable. The current deduplication only has a memory storage, so it will take up memory on the receiving side. The sender only stores hashes of blocks, so that is approx 20-30 bytes/block.

Writing:

    if writeBufferSize <= 0 {
        writeBufferSize = DefaultWriteBufferSize
    }
    if deduplication {
        // Dynamic blocks with average block size of 1KB (4KB is max).
        // Receiver can use *up to* 1GB of RAM, average will be 250MB though.
        w, err := dedup.NewStreamWriter(w, dedup.ModeDynamic, 4*1024, 1 << 30)
        // handle err
        defer w.Close()
    }
    bw := bufio.NewWriterSize(w, writeBufferSize)

You can use Split function to manually split blocks which will also flush to the writer below.

Reading:

    if readBufferSize <= 0 {
        readBufferSize = DefaultReadBufferSize
    }
    if deduplication {
        // no magic - but note it will block until it can read a few bytes.
        r, err := dedup.NewStreamReader(r)
        // handle err
        defer(r.Close)       
    }
    br := bufio.NewReaderSize(r, readBufferSize)

klauspost avatar Oct 19 '16 08:10 klauspost

Thanks! Will experiment with deduplication in spare time

valyala avatar Oct 20 '16 12:10 valyala

I think the main issue is dealing with latency and flushing at the right times, so you don't get responses that are hanging in a buffer somewhere.

Also this is straight up deduplication, it could of course be more "content-aware", so it stores documents on disk and only sends deltas. However, that is way more work in terms of synchronizing sender/receiver, since the receiver needs to communicate what is has, and keep it in sync with the sender. This is way easier, since a new connection will reset the "fragment cache".

klauspost avatar Oct 20 '16 13:10 klauspost