datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Parallel setting does not work on Windows

Open dreadatour opened this issue 10 months ago • 0 comments

Example script:

from ultralytics import YOLO

from datachain import C, DataChain, File
from datachain.model.ultralytics import YoloBBoxes


def process_bboxes(yolo: YOLO, file: File) -> YoloBBoxes:
    results = yolo(file.as_image_file().read(), verbose=False)
    return YoloBBoxes.from_results(results)


(
    DataChain.from_storage("gs://datachain-demo/openimages-v6-test-jsonpairs/")
    .filter(C("file.path").glob("*.jpg"))
    .limit(20)
    .settings(parallel=4, prefetch=4)
    .setup(yolo=lambda: YOLO("yolo11n.pt"))
    .map(boxes=process_bboxes)
    .show()
)

failed to run on Windows because of PytorchStreamReader failed reading file data/407: file read failed error (see this CI run).

Works fine without parallel setting (settings(parallel=4, prefetch=4)). Also works fine on Linux and OS X.

It looks like on Windows when parallel setup running it downloads the same "yolo11n.pt" file several times and in UDF it fails to read this file, since it is corrupted by downloading from another process.

dreadatour avatar Mar 12 '25 16:03 dreadatour