fiftyone
fiftyone copied to clipboard
[BUG] Failed to bind to port with PyTorch DistributedDataParallel
System information
- Ubuntu 18.04:
- FiftyOne installed from pip:
- FiftyOne version v0.13.2:
- Python 3.6.9 (virtual env):
Commands to reproduce
python -u -m torch.distributed.launch --nproc_per_node=4 test.py
Describe the problem
FiftyOne can't bind ports after calling the script with PyTorch DistributedDataParallel
. There are no problems with a single GPU, but with 2 and more GPUs script terminating instantly and very rare works without issues. Probably it's directly related to the multi-process parallelism of DistributedDataParallel
.
Other info / logs
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Process 16911 (service/main.py --51-service db --multi) did not respond
{"t":{"$date":"2021-09-27T06:21:42.600Z"},"s":"I", "c":"CONTROL", "id":20697, "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-27T06:21:42.609Z"},"s":"I", "c":"CONTROL", "id":20697, "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-27T06:21:42.627Z"},"s":"I", "c":"CONTROL", "id":20697, "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-27T06-21-42"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
Uncaught exception
Traceback (most recent call last):
File "train_test.py", line 7, in <module>
from data_loader_fletch import create_data_loaders
File "/home/ubuntu/LaboroTrainingOpenImages/data_loader_fletch.py", line 1, in <module>
import fiftyone as fo
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
from fiftyone.__public__ import *
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
foo.establish_db_conn(config)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
port = _db_service.port
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
return self._wait_for_child_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
return find_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Willingness to contribute
The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?
- [ ] Yes. I can contribute a fix for this bug independently.
- [x] Yes. I would be willing to contribute a fix for this bug with guidance from the FiftyOne community.
- [ ] No. I cannot contribute a bug fix at this time.
HI @Etoye. Thanks for the issue. We have some work to do to guarantee multiprocessing support it seems.
The information you provided is fairly specific, but if you provide exact reproduction steps and code, we'll be able to resolve this more quickly. Hopefully within the next couple releases.
Thorough reproduction steps allows others to chime in more easily, as well.
Hello @benjaminpkane, thank you for the prompt reply. Unfortunately, I can't provide a full code related to the issue due to NDA, but I'll describe it in detail as much as possible.
- Download test subset
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset(
"open-images-v6",
"train",
label_types=["classifications"],
classes = ["Dog", "Cat"],
max_samples=100,
seed=51,
shuffle=True,
dataset_name="open-images-test",
)
dataset.persistent = True
- data_loader.py, load fifty-one dataset and pass into PyTorch DataLoader for a custom dataset
def create_data_loaders(args):
dataset = fo.load_dataset(args.dataset_name)
train_dataset = OpenImagesDataset(dataset, transforms=tsfs)
sampler_train = None
if num_distrib() > 1:
sampler_train = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset, batch_size=args.batch_size,
shuffle=sampler_train is None, collate_fn=collate_fn,
pin_memory=True, sampler=sampler_train)
return train_loader
class OpenImagesDataset(torch.utils.data.Dataset):
def __init__(self):
....
def __getitem__(self):
....
return img, target
- train.py where we call create_data_loaders function and setup DistributedDataParallel
def main():
# arguments
args = parser.parse_args()
# setup distributed
if num_distrib() > 1:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
# Data loading
train_loader = create_data_loaders(args)
I think the closest reproduction code could be found in this repository with the only one major difference in data loading when in our case fiftyone takes care of data, not the build-in PyTorch loaders.
@benjaminpkane If we use pytorch.nn.DataParallel (single-process multi-thread parallelism) to use severe GPUs, on the first sight everything alright except UserWarning rising every epoch during model training. Not sure that it affects anything, but could you, please, have a look? Can we somehow hide this warning or fix it?
/home/ubuntu/OID/lib/python3.6/site-packages/pymongo/topology.py:165: UserWarning:
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe
It repeats several times and depends on the amount of GPU's you use. In our case it's the same 4 UserWarning's.
Hi @Etoye 👋
Thanks for the all the detailed information. While we're looking into proper distributed training support, let me mention a simple workaround:
Since FiftyOne datasets are stored in a database, doing calls like dataset[sample_id]
involve a round-trip database connection to retrieve the data. However, an alternative is just to load the filepaths and labels that you'll need into memory in the constructor of your Torch Dataset; then the data loader won't ever need to communicate with FiftyOne's database and distributed code will work fine (assuming all of the labels will fit into memory of course, which will certainly be true for classifications).
Continuing from your open images example code above, you can get the necessary data as follows:
filepaths, labels = dataset.values(["filepath", "positive_labels.classifications.labael"])
Just to be safe, avoid storing a reference to the FiftyOne dataset
in the constructor so that your dataset class is pickle-able.
For example, FiftyOne is integrated with PyTorch Lightning Flash, and that's how things are implemented there: https://github.com/PyTorchLightning/lightning-flash/blob/9e4fb62e0a7b2d3f265e574172bf88cc7b84924d/flash/core/data/data_source.py#L658-L689
Hi @brimoor,
Thank you for the hint! After the modification, there is no more UserWarning mentioned above and dataset loading as intended with pytorch.nn.DataParallel. At the same time using torch.nn.parallel.DistributedDataParallel brings the same error I've described in the bug report. Full message:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Process 61778 (service/main.py --51-service db --multi) did not respond
{"t":{"$date":"2021-09-29T03:14:05.228Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"main","msg":"Error during global initialization","attr":{"error":{"code":37,"codeName":"FileRenameFailed","errmsg":"Could not rename preexisting log file \"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log\" to \"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05\"; run with --logappend or manually remove file: No such file or directory"}}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 1:
{"t":{"$date":"2021-09-29T03:14:05.228Z"},"s":"I", "c":"CONTROL", "id":20697, "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
{"t":{"$date":"2021-09-29T03:14:05.256Z"},"s":"I", "c":"CONTROL", "id":20697, "ctx":"main","msg":"Renamed existing log file","attr":{"oldLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log","newLogPath":"/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log.2021-09-29T03-14-05"}}
Subprocess ['/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/db/bin/mongod', '--dbpath', '/home/ubuntu/.fiftyone/var/lib/mongo', '--logpath', '/home/ubuntu/.fiftyone/var/lib/mongo/log/mongo.log', '--port', '0', '--nounixsocket'] exited with error 100:
Uncaught exception
Traceback (most recent call last):
File "train.py", line 10, in <module>
import fiftyone as fo
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
from fiftyone.__public__ import *
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
foo.establish_db_conn(config)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
port = _db_service.port
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
return self._wait_for_child_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
return find_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Uncaught exception
Traceback (most recent call last):
File "train.py", line 10, in <module>
import fiftyone as fo
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
from fiftyone.__public__ import *
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
foo.establish_db_conn(config)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
port = _db_service.port
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
return self._wait_for_child_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
return find_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Uncaught exception
Traceback (most recent call last):
File "train.py", line 10, in <module>
import fiftyone as fo
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__init__.py", line 25, in <module>
from fiftyone.__public__ import *
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/__public__.py", line 14, in <module>
foo.establish_db_conn(config)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/odm/database.py", line 77, in establish_db_conn
port = _db_service.port
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 295, in port
return self._wait_for_child_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 179, in _wait_for_child_port
return find_port()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/home/ubuntu/OID/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/ubuntu/OID/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/home/ubuntu/OID/lib/python3.6/site-packages/fiftyone/core/service.py", line 177, in find_port
raise ServiceListenTimeout(etau.get_class_name(self), port)
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
Killing subprocess 61762
Killing subprocess 61763
Killing subprocess 61764
Killing subprocess 61765
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ubuntu/OID/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/OID/bin/python', '-u', 'train.py', '--local_rank=3', '--dataset_name', 'open-images-test', '--epochs', '3']' returned non-zero exit status 1.
Right, this is because (as the stack trace shows) import fiftyone
itself tries to establish a database connection.
So until we're able to properly support distributed training workflows like this, you'll need to refactor your code to avoid importing FiftyOne at all in any script that is being run via DistributedDataParallel
. For example, you could write the requisite filepaths and labels to disk in, say, JSON format and load from there in your training script.
Sounds like a plan, thank you for the prompt support. Should we close this issue until you add support for distributed training workflow or keep it open?
Let's leave it open to remind us to fix it properly :)
Hi is there any progress on the MongoDB fork warning? I am trying to load a dataset hosted on MongoDB with multiple workers and have been getting this warning. Would appreciate any update!
No update yet. import fiftyone
currently always opens a database connection in the main process, and those connections can't be serialized. It seems we'd have to find a way to omit the connection from the serialization.
Thanks for the very useful information in this thread. Having been on this journey myself, could this detail be made more prominent in the documentation or at least highlighted in the pytorch example here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/pytorch_detection_training.ipynb where the UserWarnings are present in the quoted program outputs but not referred to at all.
Hi, is there any progress on concurrent executions? I am trying to execute Ray Tune with Fiftyone but I get the same error when there two or more workers at the same time. Thanks!