indexed_gzip
indexed_gzip copied to clipboard
Allow sharing a loaded index across processes to avoid per-worker memory duplication
I have a Python file:
from concurrent.futures import ProcessPoolExecutor
import indexed_gzip as igzip
def processNdjson(ndjsonName):
with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip:
with tarfile.open(fileobj=myZip, mode="r:*") as f:
member = f.getmember(ndjsonName)
dataFile = f.extractfile(member)
for oneLine in dataFile:
# process oneLine here
if __name__ == "__main__":
indexGzipDir = ...
inTarDir = ...
nCore = 5
ndjsonNames = ["name1.ndjson", "name2.ndjson"]
with ProcessPoolExecutor(nCore) as pool:
results = pool.map(worker, ndjsonNames)
Above,
inTarDiris the directory to a .tar.gz file that contains multiple .ndjson files.indexGzipDiris the pre-index file to be used byindexed_gzip.- Each process will
with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip:
with tarfile.open(fileobj=myZip, mode="r:*") as f:
My concern: each command index_file=indexGzipDir will take up a certain amount of RAM (for example, 1.2GB for a 20GB .tar.gz file). This will grow linearly with nCore:
It would be great of we can load indexGzipDir once into some index_map object in the memory. Then every with igzip.IndexedGzipFile(str(inTarDir), index_file=indexGzipDir) as myZip in each worker can use this index_map object.
Thank you for your consideration.