indexed_gzip icon indicating copy to clipboard operation
indexed_gzip copied to clipboard

Allow sharing a loaded index across processes to avoid per-worker memory duplication

Open leanhdung1994 opened this issue 1 month ago • 1 comments

I have a Python file:

from concurrent.futures import ProcessPoolExecutor
import indexed_gzip as igzip

def processNdjson(ndjsonName):
    with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip:
        with tarfile.open(fileobj=myZip, mode="r:*") as f:
            member = f.getmember(ndjsonName)
            dataFile = f.extractfile(member)
            for oneLine in dataFile:
                # process oneLine here

if __name__ == "__main__":
    indexGzipDir = ...
    inTarDir = ...
    nCore = 5
    ndjsonNames = ["name1.ndjson", "name2.ndjson"]

    with ProcessPoolExecutor(nCore) as pool:
        results = pool.map(worker, ndjsonNames)

Above,

  • inTarDir is the directory to a .tar.gz file that contains multiple .ndjson files.
  • indexGzipDir is the pre-index file to be used by indexed_gzip.
  • Each process will
with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip:
    with tarfile.open(fileobj=myZip, mode="r:*") as f:

My concern: each command index_file=indexGzipDir will take up a certain amount of RAM (for example, 1.2GB for a 20GB .tar.gz file). This will grow linearly with nCore:

Image

It would be great of we can load indexGzipDir once into some index_map object in the memory. Then every with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip in each worker can use this index_map object.

Thank you for your consideration.

leanhdung1994 avatar Nov 29 '25 10:11 leanhdung1994