fastcdc-py icon indicating copy to clipboard operation
fastcdc-py copied to clipboard

Reduce memory copying in Cython version

Open dw opened this issue 8 months ago • 0 comments

Hey there,

Thanks for a lovely piece of code. I couldn't help but notice a tremendous (at least 20%) amount of runtime wasted on string copying, and it seems quite straightforward to resolve. The patch below tidies up the main loop to avoid any copying, netting a 42% runtime decrease on my crappy old 2015 XPS laptop, with throughput increasing from roughly 664 MiB/sec to 1158 MB/sec for my test file (a 12GB VMware vmdk). At least some of this is explained by the avoidance of copying, the rest likely due to avoiding trashing the CPU cache.

I would have submitted this as a PR, but it wasn't clear what the correct semantics for the first argument of fastcdc() are, or whether you are happy to use mmap.mmap() at all. This was only tested on Linux, but similar / the same code should work fine on Windows too. It's also not clear if there is any value in a fallback mode for situations where mmap() is not available. Perhaps there is, but they don't occur to me just now. There is also a fixed expense to setting up a mmap() that means for smaller files it may still make sense for performance reasons to fall back to regular IO.

diff --git a/fastcdc/fastcdc_cy.pyx b/fastcdc/fastcdc_cy.pyx
index c16ec81..acb8c1a 100644
--- a/fastcdc/fastcdc_cy.pyx
+++ b/fastcdc/fastcdc_cy.pyx
@@ -1,5 +1,6 @@
 # -*- coding: utf-8 -*-
 cimport cython
+import mmap
 from libc.stdint cimport uint32_t, uint8_t
 from libc.math cimport log2, lround
 from io import BytesIO
@@ -17,9 +18,11 @@ def fastcdc_cy(data, min_size=None, avg_size=8192, max_size=None, fat=False, hf=
 
     # Ensure we have a readable stream
     if isinstance(data, str):
-        stream = open(data, "rb")
+        with open(data, 'rb') as fp:
+            map = mmap.mmap(fp.fileno(), 0, access=mmap.PROT_READ)
+            stream = memoryview(map)
     elif not hasattr(data, "read"):
-        stream = BytesIO(data)
+        stream = memoryview(data)
     else:
         stream = data
     return chunk_generator(stream, min_size, avg_size, max_size, fat, hf)
@@ -32,17 +35,14 @@ def chunk_generator(stream, min_size, avg_size, max_size, fat, hf):
     mask_s = mask(bits + 1)
     mask_l = mask(bits - 1)
     read_size = max(1024 * 64, max_size)
-    blob = memoryview(stream.read(read_size))
     offset = 0
-    while blob:
-        if len(blob) <= max_size:
-            blob  = memoryview(bytes(blob) + stream.read(read_size))
+    while offset < len(stream):
+        blob = stream[offset:offset + read_size]
         cp = cdc_offset(blob, min_size, avg_size, max_size, cs, mask_s, mask_l)
         raw = bytes(blob[:cp]) if fat else b''
         h = hf(blob[:cp]).hexdigest() if hf else ''
         yield Chunk(offset, cp, raw, h)
         offset += cp
-        blob = blob[cp:]

dw avatar Oct 07 '23 09:10 dw