xla [TF pin update] The same RunGraph (Worker) request was received twice

🐛 Bug

This is to give a heads-up about next TF pin update. I found this issue when using a newer version of tensorflow to debug torch_xla GPU DDP issue (https://github.com/tensorflow/tensorflow/pull/58022).

I got the following error while running test_train_mp_imagenet.py test.

| Training Device=xla:0/6 Epoch=1 Step=620 Loss=0.00178 Rate=394.72 GlobalRate=329.85 Time=20:56:35
| Training Device=xla:0/2 Epoch=1 Step=620 Loss=0.00178 Rate=394.71 GlobalRate=329.47 Time=20:56:35
2022-10-10 20:56:40.089342: F tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1241] Non-OK-status: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) status: ABORTED: From /job:localservice/replica:0/task:7:
The same RunGraph (Worker) request was received twice. graph_handle: "0000000000000003" step_id: 100386426315014832 send { name: "/job:localservice/replica:0/task:7/device:CPU:0;2845c575404f86b6;/job:localservice/replica:0/task:7/device:CPU:0;Placeholder_112:0;0:0" tensor { dtype: DT_INT64 tensor_shape { dim { size: 24 } } tensor_content: "\342\035\336S\327\014\010\000\177\314\300\266b\014\017\000\325\242\260\374\352\220\004\000\363>\373\332\256,\000\000\206\351\207\"uq\001\000\204\346\026\264 \372\004\000x\031\357\177\377\316\r\000\307P++\317\240\t\000_\251n@\235\273\016\000\220\266\262Gr\371\010\000\003\254nYD\247\001\000\031J\257o\227B\n\000\264\031d\200aG\017\000\257\221\234\357j\307\006\000M\232 \307k\263\006\000T*\251\360\317L\003\000\021\240\2122\010\234\016\000w\275.m\233/\005\000\363u\237\244\334\200\t\000\232\353un=m\r\000H\007\242\225\333\200\016\000L)\371A\336\244\000\000@k\331\033\211\007\017\000\326\rI=\205\031\017\000" } } exec_opts { } session_handle: "aed9ceb6d61f0fa7" store_errors_in_response_body: true create_worker_session_called: true request_id: -3669225667362124901
*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
	xla::XrtComputationClient::HandleReleaser()
	xla::util::TriggeredTask::Runner()
	
	
	clone
*** End stack trace ***

Traceback (most recent call last):
  File "test_train_mp_imagenet.py", line 272, in <module>
    xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=FLAGS.num_cores)
  File "/home/ubuntu/src/pytorch/xla/torch_xla/distributed/xla_multiprocessing.py", line 393, in spawn
    return torch.multiprocessing.start_processes(
  File "/home/ubuntu/src/pytorch/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/src/pytorch/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 7 terminated with signal SIGABRT

To Reproduce

Update TF pin to 4c4ef6d780f4d6669c4cf156edd3ffb6e0787d5a.

Run the following scripts, this error should occur within the first epoch. I tested it on GPU but It's unlikely that this is a GPU specific issue.

GPU_NUM_DEVICES=8 python test_train_mp_imagenet.py --fake_data

Additional context

After some debugging, I found the following offending tensorflow commit:

commit 0de6ecda97e261528b51709c11a4e7e22a39ca33
Author: A. Unique TensorFlower <[email protected]>
Date:   Wed Sep 28 14:48:25 2022 -0700

    Use thread local random generator to generate request id for RpcRecvTensorCall since using a singleton random generator can cause lock contention in large scale distributed training.
    
    PiperOrigin-RevId: 477557876

It looks like this error is due to duplicate random number generated from multiple threads after this TF change.

Oct 10 '22 21:10 ymwangg

Thanks for the headsup, I did a quick check and it seems like our latest pin does not include the problematic pr(Correct me if I am wrong). The remaining action item is to make sure next update includes your fix. Let me know if you need to find xla:gpu reviewers.

Oct 10 '22 23:10 JackCaoG

We run into the same issue in https://github.com/pytorch/xla/pull/4101, through I am not sure why cpu CI also failed through...

Oct 21 '22 00:10 JackCaoG

@ymwangg do you know if we need to do anything on CPU side to fix this error?

Oct 21 '22 00:10 JackCaoG

To fix this error, I simply reverted commit 0de6ecda97e261528b51709c11a4e7e22a39ca33. I suspect this is a XRT related issue and is not specific to any platforms.

Oct 21 '22 01:10 ymwangg

xla xla copied to clipboard

[TF pin update] The same RunGraph (Worker) request was received twice

🐛 Bug

To Reproduce

Additional context

xla
xla copied to clipboard