Alif Munim comments

Results 8 comments of


Alif Munim

Intermittent multiprocessing error on google cloud TPU

**Error 1 (Full):** ``` Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (PERMISSION_DENIED: open(/dev/accel0): Operation not permitted: Operation not permitted; Couldn't open device: /dev/accel0; Unable to create...

Intermittent multiprocessing error on google cloud TPU

**Error 2 (Full):** ``` Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8Exception in device=TPU:1: Cannot replicate if number of devices (1) is different from...

Intermittent multiprocessing error on google cloud TPU

Found some additional information on the pytorch lightning docs [section on TPUs](https://pytorch-lightning.readthedocs.io/en/latest/accelerators/tpu_faq.html#how-to-resolve-the-replication-issue), which mentions that you should not call `xm.xla_device()` outside of the spawn process. I've removed that line, and...

Intermittent multiprocessing error on google cloud TPU

**Error 3 (Full):** ``` Exception in device=TPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) *** Begin stack trace *** tensorflow::CurrentStackTrace() xla::service::MeshClient::MeshClient(std::string const&) xla::service::MeshClient::Get() xla::ComputationClient::Create() xla::ComputationClient::Get() PyCFunction_Call _PyObject_MakeTpCall _PyEval_EvalFrameDefault _PyFunction_Vectorcall...

Intermittent multiprocessing error on google cloud TPU

**Error 4 (Full):** ``` 2022-08-29 20:10:46.436423: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1661803846.436277160","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC ```...

Alif Munim

Intermittent multiprocessing error on google cloud TPU

Intermittent multiprocessing error on google cloud TPU

Intermittent multiprocessing error on google cloud TPU

Intermittent multiprocessing error on google cloud TPU

Intermittent multiprocessing error on google cloud TPU

Intermittent multiprocessing error on google cloud TPU

Noise on all image for training

Noise on all image for training