spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Feature Request - n_process with multi-GPU support

Open ginward opened this issue 3 years ago • 0 comments

This is related to https://github.com/explosion/spaCy/discussions/8782

Currently spacy support the argument n_process, but it does not distribute the work to different GPUs. Suppose I have four GPUs on a machine, it would be nice if I could start a process with each using a different GPU, like the following code (I am not sure if it is the correct way to do it though):


from joblib import Parallel, delayed
import cupy

rank = 0

def chunker(iterable, total_length, chunksize):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))

def flatten(list_of_lists):
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]

def process_chunk(texts):
    global rank
    with cupy.cuda.Device(rank):
        import spacy
        from thinc.api import set_gpu_allocator, require_gpu
        set_gpu_allocator("pytorch")
        require_gpu(rank)
        preproc_pipe = []
        for doc in nlp.pipe(texts, batch_size=20):
            preproc_pipe.append(lemmatize_pipe(doc))
        rank+=1
        return preproc_pipe

def preprocess_parallel(texts, chunksize=100):
    executor = Parallel(n_jobs=4, backend='multiprocessing', prefer="processes")
    do = delayed(process_chunk)
    tasks = (do(chunk) for chunk in chunker(texts, len(texts), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

preprocess_parallel(texts = ["His friend Nicolas J. Smith is here with Bart Simpon and Fred."*100], chunksize=1000)

ginward avatar Jul 21 '21 11:07 ginward