tensorflow-handbook icon indicating copy to clipboard operation
tensorflow-handbook copied to clipboard

TPU 分布式计算

Open huan opened this issue 5 years ago • 6 comments

TPU 章节计划包括以下几部分内容:

  • Cloud TPU
    • v2
    • v3
    • Pod
  • Edge TPU (Coral)

~~目前看来,第一版可能来不及涵盖,所以计划在第一版中不包括 TPU 部分内容。(如果之后书出版之前还有时间补充,可以补充最基本的 Google Cloud TPU 配置方法)~~

~~大家看这样是否可以? @snowkylin @dpinthinker~~


  1. UPDATE(29 Aug 2019): TensorFlow 2.0/2.1 TPU Support Track Issue: https://github.com/tensorflow/tensorflow/issues/24412#issuecomment-525960626
  2. UPDATE(17 Mar 2019): 经过和锡涵讨论,TF2.0正式发布之前还能有一些时间,所以决定继续补充一个最基本的版本,5-10页

huan avatar Mar 17 '19 16:03 huan

Will start writting this chapter this week.

huan avatar Aug 25 '19 20:08 huan

Reviews from @snowkylin

TPU

  • [ ] Move minor contents into tips box
    • [ ] confirm env environment text move to the tip box
  • [ ] Use TF existing model in example code
  • [ ] Add benchmark comparison to other strategies: GPU, multiple GPU, and multiple Servers
  • [ ] Study xihan's distribute chapter, align to it.
  • [ ] Add source link to each image

huan avatar Sep 09 '19 06:09 huan

您好, 章节《使用 TPU 训练 TensorFlow 模型(Huan)》的示例colab文件(https://colab.research.google.com/github/huan/tensorflow-handbook-tpu/blob/master/tensorflow-handbook-tpu-example.ipynb)无法跑通,显示

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run AutoShardDataset: Unable to parse tensor proto
Additional GRPC error information:
{"created":"@1571137943.518656507","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to parse tensor proto","grpc_status":3} [Op:AutoShardDataset]

请求解答,谢谢

JimXiongGM avatar Oct 15 '19 11:10 JimXiongGM

@JimXiongGM Hi, thanks for trying the TF2.0 with Colab & TPU!

The TensorFlow 2.0 has not finished TPU support in Colab. I get some updates from Googler and they said that it will be fully supported in TensorFlow 2.1.

This is a known issue and you can learn more from https://github.com/tensorflow/tensorflow/issues/33045#issuecomment-539148033 and https://github.com/huan/tensorflow-handbook-tpu/issues/1

The Workaround

Before the TF2.1 was released, you can use the latest TF1.x code and use eager execution, which all the API is quite like the TF2.0.

And you can switch to TF2.1 after the 2.1 is released, with very few code modifications.

P.S. I will update the chapter to describe this problem in detail today.

huan avatar Oct 15 '19 12:10 huan

thanks a lot ;-)

JimXiongGM avatar Oct 15 '19 12:10 JimXiongGM

You are welcome. :)

huan avatar Oct 15 '19 12:10 huan