spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Inconsistent NER predictions from identical inputs while using ThreadPoolExecutor

Open pege345 opened this issue 3 years ago • 1 comments
trafficstars

When running data through the en_core_web_trf model concurrently I am getting different results between runs. I cannot find anywhere in the documentation or other github issues where this behaviour is explained.

The below code reproduces the behaviour, If I don't run data through the pipeline concurrently (e.g. setting max_workers=1) I find the result to always be consistent.

import spacy
from concurrent.futures import ThreadPoolExecutor

nlp = spacy.load("en_core_web_trf")

def extract_entities(sentences):
    with ThreadPoolExecutor(max_workers=4) as e:
        submitted = [e.submit(call_spacy, sent) for sent in sentences]
        resolved = [item.result() for item in submitted]

        return resolved

def call_spacy(sent):
    result = nlp(sent)
    return result.ents

input =[
	"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
	"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
	"Hard working laborers visited CoCo Town to congregate at the diner.",
	"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
	"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    result = extract_entities(input)
    print(sum([len(x) for x in result]))

Your Environment

  • Operating System: Amazon Linux 2 Kernel: Linux 4.14.294-220.533.amzn2.x86_64
  • Python Version Used: python 3.7.10
  • spaCy Version Used: 3.1.3
  • Environment Information: en-core-web-trf==3.1.0

pege345 avatar Nov 25 '22 02:11 pege345

I can reproduce this, but it's probably related to torch rather than spacy directly and I'm not as sure about what might be going on in torch that would cause this. We'll take a look!

What we'd recommend instead as the first alternative to try is our built-in multiprocessing with nlp.pipe:

import spacy
import torch

torch.set_num_threads(1)

nlp = spacy.load("en_core_web_trf")

input =[
        "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
        "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
        "Hard working laborers visited CoCo Town to congregate at the diner.",
        "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
        "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))

Notes:

  • You usually need torch.set_num_threads(1) to avoid a deadlock related to multiprocessing with torch (more details in #4667).
  • No matter whether you use nlp.pipe(n_process=) for multiprocessing, you should process texts in batches with nlp.pipe for improved speed.

adrianeboyd avatar Nov 25 '22 09:11 adrianeboyd