NBT icon indicating copy to clipboard operation
NBT copied to clipboard

Chunk and Region iterators in WorldFolder

Open macfreek opened this issue 12 years ago • 1 comments

A relative recent addition to NBT is world.py with the WorldFolder class. The expected use is for tools that iterate through all Chunks, without caring about the specific Region file.

A common complaint I hear is that NBT is slow. One way to speed things up is to process each region file using a different subprocess and combine the results (this would be a Map-Reduce pattern). The best way to implement this is using a callback function.

E.g.:

def count_blocks(chunk):
    """Given a chunk, return the number of block IDs in this chunk"""
    chunk_block_count = [0]*256   # array of 256 integers, one for each block ID
    for block_id in chunk.get_all_blocks():
        chunk_block_count[block_id] += 1
    return chunk_block_count

def summarize_blocks(chunk_block_counts):
    """Given multiple chunk_block_count arrays, add them together."""
    total_block_count = [0]*256   # array of 256 integers, one for each block ID
    for chunk_block_count in chunk_block_counts:
        for block_id in range(256):
            total_block_count[block_id] += chunk_block_count[block_id]
    return total_block_count

world = WorldFolder(myfolder)
block_count = world.chunk_mapreduce(count_blocks, summarize_blocks)

However, I fear that the term "mapreduce" is not well know with all programmers, and I'm looking for an easier name. Would the following be easier to understand?

world = WorldFolder(myfolder)
chunk_block_counts = world.process_chunks(count_blocks)
block_count = summarize_blocks(chunk_block_counts)

The advantage is that the parallelisation can happen behind the scenes (though the multiprocessing.Pool class already makes it very easy).

The disadvantage is that it adds a third method to the existing get_chunks and iter_chunks methods in the WorldFolder class. In addition, there probably also need a process_nbt and process_regions next to process_chunks.

In retrospect, the difference between get_chunks (which returns a list) and iter_chunks (which returns an iterator) is so minor (iterators consume less memory, but lists can be cached) that it did not warrant the double function.

I'm inclined to remove the cached get_chunks (though I liked the name better than iter_chunks).

Any opinions?

macfreek avatar Apr 12 '12 01:04 macfreek

My opinion would be to remove the get_chunks and rename iter_chunks. I think usual usage is moving through the chunks, and not often needing caching of chunks for later access. The multiprocessing``Map-Reduce is beyond what I know, so I can't really say about that.

stumpylog avatar Apr 13 '12 06:04 stumpylog