fiftyone icon indicating copy to clipboard operation
fiftyone copied to clipboard

[BUG] Failed to bind to port with PyTorch DistributedDataParallel

Open Etoye opened this issue 2 years ago • 13 comments

System information

  • Ubuntu 18.04:
  • FiftyOne installed from pip:
  • FiftyOne version v0.13.2:
  • Python 3.6.9 (virtual env):

Commands to reproduce

python -u -m torch.distributed.launch --nproc_per_node=4 test.py

Describe the problem

FiftyOne can't bind ports after calling the script with PyTorch DistributedDataParallel. There are no problems with a single GPU, but with 2 and more GPUs script terminating instantly and very rare works without issues. Probably it's directly related to the multi-process parallelism of DistributedDataParallel.

Other info / logs

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Process 16911 (service/main.py --51-service db --multi) did not respond
{"t":{"$date":"2021-09-27T06:21:42.600Z"},"s":"I",  "c":"CONTROL",  "id":20697,   "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-27T06:21:42.609Z"},"s":"I",  "c":"CONTROL",  "id":20697,   "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-27T06:21:42.627Z"},"s":"I",  "c":"CONTROL",  "id":20697,   "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
Uncaught exception
Traceback (most recent call last):
  File "train_test.py", line 7, in <module>
    from data_loader_fletch import create_data_loaders
  File "/home/ubuntu/LaboroTrainingOpenImages/data_loader_fletch.py", line 1, in <module>
    import fiftyone as fo
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
    from fiftyone.__public__ import *
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
    foo.establish_db_conn(config)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
    port = _db_service.port
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
    return self._wait_for_child_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
    return find_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
    raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [x] Yes. I would be willing to contribute a fix for this bug with guidance from the FiftyOne community.
  • [ ] No. I cannot contribute a bug fix at this time.

Etoye avatar Sep 27 '21 06:09 Etoye

HI @Etoye. Thanks for the issue. We have some work to do to guarantee multiprocessing support it seems.

The information you provided is fairly specific, but if you provide exact reproduction steps and code, we'll be able to resolve this more quickly. Hopefully within the next couple releases.

Thorough reproduction steps allows others to chime in more easily, as well.

benjaminpkane avatar Sep 27 '21 16:09 benjaminpkane

Hello @benjaminpkane, thank you for the prompt reply. Unfortunately, I can't provide a full code related to the issue due to NDA, but I'll describe it in detail as much as possible.

  1. Download test subset
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "open-images-v6", 
    "train", 
    label_types=["classifications"], 
    classes = ["Dog", "Cat"],
    max_samples=100,
    seed=51,
    shuffle=True,
    dataset_name="open-images-test",
)

dataset.persistent = True
  1. data_loader.py, load fifty-one dataset and pass into PyTorch DataLoader for a custom dataset
def create_data_loaders(args):

    dataset = fo.load_dataset(args.dataset_name)
    train_dataset = OpenImagesDataset(dataset, transforms=tsfs)

    sampler_train = None
    if num_distrib() > 1:
        sampler_train = torch.utils.data.distributed.DistributedSampler(train_dataset)

    train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset, batch_size=args.batch_size, 
        shuffle=sampler_train is None, collate_fn=collate_fn,
        pin_memory=True, sampler=sampler_train)

    return train_loader


class OpenImagesDataset(torch.utils.data.Dataset):
    def __init__(self):
        ....

    def __getitem__(self):
        ....
        return img, target
  1. train.py where we call create_data_loaders function and setup DistributedDataParallel
def main():
    # arguments
    args = parser.parse_args()

    # setup distributed
    if num_distrib() > 1:
        torch.cuda.set_device(args.local_rank)
        torch.distributed.init_process_group(backend='nccl', init_method='env://')

    # Data loading
    train_loader = create_data_loaders(args)

I think the closest reproduction code could be found in this repository with the only one major difference in data loading when in our case fiftyone takes care of data, not the build-in PyTorch loaders.

Etoye avatar Sep 28 '21 03:09 Etoye

@benjaminpkane If we use pytorch.nn.DataParallel (single-process multi-thread parallelism) to use severe GPUs, on the first sight everything alright except UserWarning rising every epoch during model training. Not sure that it affects anything, but could you, please, have a look? Can we somehow hide this warning or fix it?

/home/ubuntu/OID/lib/python3.6/site-packages/pymongo/topology.py:165: UserWarning:

MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe

It repeats several times and depends on the amount of GPU's you use. In our case it's the same 4 UserWarning's.

Etoye avatar Sep 28 '21 03:09 Etoye

Hi @Etoye 👋

Thanks for the all the detailed information. While we're looking into proper distributed training support, let me mention a simple workaround:

Since FiftyOne datasets are stored in a database, doing calls like dataset[sample_id] involve a round-trip database connection to retrieve the data. However, an alternative is just to load the filepaths and labels that you'll need into memory in the constructor of your Torch Dataset; then the data loader won't ever need to communicate with FiftyOne's database and distributed code will work fine (assuming all of the labels will fit into memory of course, which will certainly be true for classifications).

Continuing from your open images example code above, you can get the necessary data as follows:

filepaths, labels = dataset.values(["filepath", "positive_labels.classifications.labael"])

Just to be safe, avoid storing a reference to the FiftyOne dataset in the constructor so that your dataset class is pickle-able.

brimoor avatar Sep 28 '21 13:09 brimoor

For example, FiftyOne is integrated with PyTorch Lightning Flash, and that's how things are implemented there: https://github.com/PyTorchLightning/lightning-flash/blob/9e4fb62e0a7b2d3f265e574172bf88cc7b84924d/flash/core/data/data_source.py#L658-L689

brimoor avatar Sep 28 '21 13:09 brimoor

Hi @brimoor,

Thank you for the hint! After the modification, there is no more UserWarning mentioned above and dataset loading as intended with pytorch.nn.DataParallel. At the same time using torch.nn.parallel.DistributedDataParallel brings the same error I've described in the bug report. Full message:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Process 61778 (service/main.py --51-service db --multi) did not respond
{"t":{"$date":"2021-09-29T03:14:05.228Z"},"s":"F",  "c":"CONTROL",  "id":20574,   "ctx":"main","msg":"Error during global initialization","attr":{"error":{"code":37,"codeName":"FileRenameFailed","errmsg":"Could not rename preexisting log file \"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log\" to \"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05\"; run with --logappend or manually remove file: No such file or directory"}}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 1:
{"t":{"$date":"2021-09-29T03:14:05.228Z"},"s":"I",  "c":"CONTROL",  "id":20697,   "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-29T03:14:05.256Z"},"s":"I",  "c":"CONTROL",  "id":20697,   "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
Uncaught exception
Traceback (most recent call last):
  File "train.py", line 10, in <module>
    import fiftyone as fo
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
    from fiftyone.__public__ import *
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
    foo.establish_db_conn(config)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
    port = _db_service.port
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
    return self._wait_for_child_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
    return find_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
    raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Uncaught exception
Traceback (most recent call last):
  File "train.py", line 10, in <module>
    import fiftyone as fo
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
    from fiftyone.__public__ import *
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
    foo.establish_db_conn(config)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
    port = _db_service.port
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
    return self._wait_for_child_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
    return find_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
    raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Uncaught exception
Traceback (most recent call last):
  File "train.py", line 10, in <module>
    import fiftyone as fo
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
    from fiftyone.__public__ import *
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
    foo.establish_db_conn(config)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
    port = _db_service.port
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
    return self._wait_for_child_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
    return find_port()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
    raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Killing subprocess 61762
Killing subprocess 61763
Killing subprocess 61764
Killing subprocess 61765
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/OID/bin/python', '-u', 'train.py', '--local_rank=3', '--dataset_name', 'open-images-test', '--epochs', '3']' returned non-zero exit status 1.

Etoye avatar Sep 29 '21 03:09 Etoye

Right, this is because (as the stack trace shows) import fiftyone itself tries to establish a database connection.

So until we're able to properly support distributed training workflows like this, you'll need to refactor your code to avoid importing FiftyOne at all in any script that is being run via DistributedDataParallel. For example, you could write the requisite filepaths and labels to disk in, say, JSON format and load from there in your training script.

brimoor avatar Sep 29 '21 03:09 brimoor

Sounds like a plan, thank you for the prompt support. Should we close this issue until you add support for distributed training workflow or keep it open?

Etoye avatar Sep 29 '21 03:09 Etoye

Let's leave it open to remind us to fix it properly :)

brimoor avatar Sep 29 '21 03:09 brimoor

Hi is there any progress on the MongoDB fork warning? I am trying to load a dataset hosted on MongoDB with multiple workers and have been getting this warning. Would appreciate any update!

EilsonH avatar Aug 18 '22 04:08 EilsonH

No update yet. import fiftyone currently always opens a database connection in the main process, and those connections can't be serialized. It seems we'd have to find a way to omit the connection from the serialization.

brimoor avatar Aug 18 '22 13:08 brimoor

Thanks for the very useful information in this thread. Having been on this journey myself, could this detail be made more prominent in the documentation or at least highlighted in the pytorch example here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/pytorch_detection_training.ipynb where the UserWarnings are present in the quoted program outputs but not referred to at all.

tom-robo avatar Nov 16 '22 11:11 tom-robo

Hi, is there any progress on concurrent executions? I am trying to execute Ray Tune with Fiftyone but I get the same error when there two or more workers at the same time. Thanks!

davidpob99 avatar Jun 06 '24 08:06 davidpob99