lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Waymo Dataset Format

Open yinjunbo opened this issue 4 years ago • 21 comments

I used generate_waymo_tf.py to generate the tf.EXAMPLE for training StarNet model. But I'm not sure about the output format. As far as I understand, each segment in Waymo will result in a individual *0000-of-00001 (tf.EXAMPLE file), and 1150 segments will output 1150 files. So should I add an loop operation in the generate_waymo_tf.py to process all the segments? Or should all the segments be processed into only one tf.EXAMPLE file?

yinjunbo avatar May 22 '20 05:05 yinjunbo

@vrv

yinjunbo avatar May 22 '20 05:05 yinjunbo

Besides, when I trained StarNet on KITTI with bazel-bin/lingvo/trainer, it failed with

ValueError: Failed to create a one-shot iterator for a dataset. 
`Dataset.make_one_shot_iterator()` does not support datasets that capture stateful objects, such as a `Variable` or `LookupTable`. 
In these cases, use `Dataset.make_initializable_iterator()`. 
(Original error: Cannot capture a stateful node 
(name:bbox_aug/GroundTruthAugmentor/list_files/AnonymousRandomSeedGenerator, type:AnonymousRandomSeedGenerator) by value.)

It seems that the GT data augmentation is not initialized successfully. Is there a instruction on how to configure the GT data augmentation module?

yinjunbo avatar May 24 '20 07:05 yinjunbo

The info for p.groundtruth_database,

Loading groundtruth database at {
  allow_implicit_capture: None
  cls: <class 'lingvo.core.datasource.PrefixedDataSource'>
  dtype: <dtype: 'float32'>
  file_pattern: "kitti_train_object_cls.tfrecord-00000-of-00100"
  file_pattern_prefix: "/home/junbo/datasets/KITTI/kitti_object/starnet-tfr/"
  file_type: ""
  fprop_dtype: None
  inference_driver_name: None
  is_inference: None
  name: "datasource"
  params_init: {
    method: "xavier"
    scale: 1.000001
    seed: None
  }
  random_seed: None
  skip_lp_regularization: None
  vn: {
    global_vn: False
    per_step_vn: False
    scale: None
    seed: None
  }
}

yinjunbo avatar May 24 '20 07:05 yinjunbo

We have never seen that before, unfortunately, but I think the description of the error message suggests what the problem is.

The ground truth database “list_files” operation uses a stateful random seed, which means we can’t initialize the pipeline using one_shot_iterator().

Two possible solutions: provide a seed to list_files() so that it is deterministic in how it selects the files (probably doesn’t matter for accuracy, I suspect), or change the input pipeline code to use a make_initializable_iterator(). I’m not too familiar with the second approach in how it might need to be integrated, so setting the random seed to a specific value may be the easiest thing to do.

vrv avatar May 24 '20 18:05 vrv

As to your other comment about the dataset: the pipeline is an Apache Beam pipeline that should run over every example in the input file pattern: so if you specify a wildcard path pointing to all of the input files, it will run on all of the run segments.

If you run on a cloud instance, typically they are all run in parallel, but running locally it should run one at a time.

Note that the time it takes to generate the data is quite high — I’m trying to get the Waymo team to host the latest data in a public bucket but it’s taking longer than I expected to get this done. Will update you if this ever happens.

vrv avatar May 24 '20 18:05 vrv

@vrv, I have specified a root path containing to all of the input files, and the output file train.tfr-00000-of-00001 seems to include all the 798 segments with a size of totally 5.5TB disk occupation (~5 days). I'm not sure is this all right?
Besides, have you tried to train StarNet with a local machine? How long will it cost? Or what's the difference of training on GCP and local machine?

yinjunbo avatar May 25 '20 06:05 yinjunbo

@yinjunbo Ah, if you run the pipeline specifying a sharded output (e.g., --output=/path/to/waymo.tfr@1000) then it would create 1000 files instead of one gigantic file.

That being said, the 5.5TiB output size does sound roughly correct, so it does look like you generated it correctly.

I haven't tried training StarNet locally since I personally don't have any local accelerators. My guess is that the model as currently configured does take a long time to train unless you have sufficient accelerators available (e.g., 8 GPUs). As mentioned in the other issues, you can still train a model by reducing the number of centers and/or the number of points per center, and that should train much faster. In addition, you can use a cheaper featurizer (fewer hidden units, depth, etc.), but I'm not sure whether the result will be as good.

vrv avatar May 26 '20 20:05 vrv

@vrv Thanks for your advice for training locally. As for the dataset format, in my case, I run it with

python generate_waymo_tf.py
--input_file_pattern=/path_to_waymo_data/segment-*_with_camera_labels.tfrecord
--output_filebase=/path_to_preprocessed_data/train.tfr@1000

Then, it just output a single gigantic file with name train.tfr@1000-00000-of-00001, instead of the 1000 files. I think the random sampling process of dataloader is different between 1 file and 1000 files. BTW, I'a not abling to add

  --project=$PROJECT \
  --temp_location=$TEMP_DIR \
  --runner=DataflowRunner

into the python commad for it failed with FATAL Flags parsing error: Unknown command line flag 'runner' Does this cause the difference?

yinjunbo avatar May 27 '20 08:05 yinjunbo

Hm, maybe that @1000 syntax doesn't work as I thought it did. You may be able to pass num_shards=1000 to the GetWriter() function in that file to do the right thing.

As for the lack of the --runner flag, I suspect you need the https://beam.apache.org/documentation/runners/dataflow/ Cloud Dataflow SDK installed in order for this to work, but that's only if you want to run the pipeline in the cloud, in parallle.

vrv avatar May 28 '20 00:05 vrv

@vrv Following your advice of adding num_shards=1000, I can now obtain the correct file number. However, it will cost a long time to prcoess them with a local machine and it seems that only a single GPU is activate. I'm now turning to the GCP. Will the Cloud Dataflow SDK help to process the tf.segments in parallle with multiple GPUs? How long will it take to genetate the 1000 files on GCP?

yinjunbo avatar May 28 '20 06:05 yinjunbo

Cool, yeah it does take a long time to process them if running locally. As for GCP, it is possible to run but I never ran the full pipeline there, so I don't know how long it takes / how much it costs.

We finally were able to upload the latest data to GCS here, so you don't have to do the data processing yourself:

gsutil ls gs://waymo_open_dataset_v_1_0_0_tf_example_lingvo/v.1.2.0

Let us know if that works for you!

vrv avatar May 28 '20 16:05 vrv

@vrv, Thanks a lot for providing the preprocessed data! BTW, I find the test TF.Example is just 100MB, which is much little than train (~5G). Is it all right?

yinjunbo avatar May 30 '20 12:05 yinjunbo

I'm now try to train with TPU cluster, but there is an issue:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/lingvo/lingvo/trainer.py", line 1859, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/tmp/lingvo/lingvo/trainer.py", line 1850, in main
    RunnerManager(FLAGS.model).Start()
  File "/tmp/lingvo/lingvo/trainer.py", line 1846, in Start
    self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
  File "/tmp/lingvo/lingvo/trainer.py", line 1590, in CreateRunners
    trial)
  File "/tmp/lingvo/lingvo/trainer.py", line 1547, in _CreateRunner
    return self.TrainerTpu(cfg, *common_args)
  File "/tmp/lingvo/lingvo/trainer.py", line 570, in __init__
    _WaitUntilInitTpu()
 File "/tmp/lingvo/lingvo/core/retry.py", line 53, in Wrapper
    return func(*args, **kwargs)
  File "/tmp/lingvo/lingvo/trainer.py", line 559, in _WaitUntilInitTpu
    num_replicas=data_parallelism)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/device_assignment.py", line 258, in device_assignment
    topology = Topology(serialized=topology)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/topology.py", line 78, in __init__
    self._parse_topology(serialized)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/topology.py", line 104, in _parse_topology
    "entries; got {}".format(self._mesh_shape))
ValueError: `mesh_shape` must be a vector of size 3 with positive entries; got [4 4 1 2]

Besides, where is the code to activate the directional head in StarNet?

yinjunbo avatar May 30 '20 16:05 yinjunbo

Hi @yinjunbo, to answer your questions to the best of my ability:

  • There are fewer test examples than training examples; you can look for the ‘num_samples’ fields in params/waymo.py I believe.

  • The directional loss is a parameter of the StarNet Params() configuration (use_directional_aware_loss).

  • That last bug I think is caused by an incorrect config in lingvo/trainer.py, I was hoping it was fixed already :(. @jonathanasdf @benlee ?

I believe the answer is to remove the third ‘1’ from every list here: https://github.com/tensorflow/lingvo/blob/5a0b6b97f6aa27923e9d2976fa921fc47b55d665/lingvo/core/py_utils.py#L4320

vrv avatar May 31 '20 05:05 vrv

@vrv ,Could you run you code on GKE successfully? What's your tensorflow version for running on TPU? I find It's not a issue caused by ComputationShape. In fact, the topology is not right in the Line 258 in ~/tensorflow/tensorflow/python/tpu/device_assignment.py. Please check double your code. I have been working on it for 2 days, but could not figure out the right topology for I never used tpu or GKE before.

yinjunbo avatar May 31 '20 16:05 yinjunbo

Internally the code works today, and I wrote the gke_launch.py last year and it was working, but things have changed in the underlying libraries and there isn’t a lot of testing support on the lingvo side for TPU issues in GCP.

I’d recommend trying to run training on GPUs for now until the TPU setup is addressed, unless you want to invest more time trying things.

I do believe that ComputationShape is wrong for GCP (the length of the topology array needs to be 3 for GCP, not 4), but I probably won’t have a lot of time to investigate this in the upcoming week to try myself.

vrv avatar May 31 '20 17:05 vrv

If you look at the current HEAD of tensorflow/python/tpu/topology.py, you’ll see that mesh_shape is expected to be a length 4 array, so our code is correct if you’re using a HEAD version of tensorflow, but I don’t know if that change is in a released version of TensorFlow that could be upgraded to.

(Internally we always run at HEAD, which is why our code is newer than what is officially released). My short term suggestion is just to change ComputationShape() to length 3 arrays and if that doesn’t work, try using GPUs until this gets fixed.

Cc @benlee, @jonathanasdf, @bignamehyp

vrv avatar May 31 '20 17:05 vrv

@vrv , I can train locally with 8*32G GPUs, but the bach size should be 1 that leads to a very long training time. I have changed ComputationShape() to length 3 , but it doesn't help. In the ~/tensorflow/tensorflow/python/tpu/topology.py of the released verion 2.2, it requests a 3-d _mesh_shape:

    self._serialized = serialized
    if serialized:
      self._parse_topology(serialized)
    else:
      self._mesh_shape = np.asarray(mesh_shape, dtype=np.int32)
      self._device_coordinates = np.asarray(device_coordinates, np.int32)
      if len(self._mesh_shape) != 3 or any(self._mesh_shape < 1):
        raise ValueError("`mesh_shape` must be a sequence of 3 positive "
                         "entries; got {}".format(self._mesh_shape))
      if (len(self._device_coordinates.shape) != 3 or
          self._device_coordinates.shape[2] != len(self._mesh_shape)):
        raise ValueError("`device_coordinates` must be a rank 3 int32 array "
                         "with minor dimension equal to the mesh shape rank")
    self._topology_tasks, self._topology_devices = self._invert_topology()
    # Coordinates of devices that are missing
    self._missing_devices = np.argwhere(self._topology_tasks < 0)

I finally fixed this by upgrating the tf version to tf-nightly-2.3.

yinjunbo avatar Jun 01 '20 06:06 yinjunbo

However, I cannot compile the trainer bazel build -c opt //lingvo:trainer with tf-nightly-2.3.

ERROR: /home/junbo/repository/starnet/lingvo/lingvo/core/ops/BUILD:474:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1) gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 63 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
In file included from bazel-out/k8-opt/bin/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/bin/lingvo/core/ops/hyps.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
 #error This file was generated by an older version of protoc which is
  ^
bazel-out/k8-opt/bin/lingvo/core/ops/hyps.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
 #error incompatible with your Protocol Buffer headers. Please
  ^
bazel-out/k8-opt/bin/lingvo/core/ops/hyps.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
 #error regenerate this file with a newer version of protoc.
  ^
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 22.080s, Critical Path: 6.12s
INFO: 207 processes: 207 linux-sandbox.
FAILED: Build did NOT complete successfully

Is there a available tensorflow version that simultaneously support to compile bazel build -c opt //lingvo:trainer and have the correct topology.py.

yinjunbo avatar Jun 02 '20 07:06 yinjunbo

I don't have a good answer for you, every time we upgrade tf we always run into issues like these too :(. I still think the right answer is to effectively undo https://github.com/tensorflow/lingvo/commit/9ecb46dea2a2bd42be0baf63811b82f625bcf045#diff-54268ff347c5ce09f80ddf7b19889c03 in the current version of lingvo. That's the best suggestion I have until lingvo upgrades to the next release of tf.

vrv avatar Jun 02 '20 23:06 vrv

I changed the meshape to [4,4,2] in Line 101 of https://github.com/tensorflow/tensorflow/blob/3ffdb91f122f556a74a6e1efd2469bfe1063cb5c/tensorflow/python/tpu/topology.py (just an exemple for I cannot find the tensrflow-2.2 code installed by pip3) . Then it turns out that the device_coordinates is not right. There seems much work to do to run the TPU code ...

  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/lingvo/lingvo/trainer.py", line 1866, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/tmp/lingvo/lingvo/trainer.py", line 1857, in main
    RunnerManager(FLAGS.model).Start()
  File "/tmp/lingvo/lingvo/trainer.py", line 1853, in Start
    self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
  File "/tmp/lingvo/lingvo/trainer.py", line 1597, in CreateRunners
    trial)
  File "/tmp/lingvo/lingvo/trainer.py", line 1554, in _CreateRunner
    return self.TrainerTpu(cfg, *common_args)
  File "/tmp/lingvo/lingvo/trainer.py", line 577, in __init__
    _WaitUntilInitTpu()
  File "/tmp/lingvo/lingvo/core/retry.py", line 53, in Wrapper
    return func(*args, **kwargs)
  File "/tmp/lingvo/lingvo/trainer.py", line 566, in _WaitUntilInitTpu
    num_replicas=data_parallelism)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/device_assignment.py", line 258, in device_assignment
    topology = Topology(serialized=topology)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/topology.py", line 91, in __init__
    self._topology_tasks, self._topology_devices = self._invert_topology()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/topology.py", line 137, in _invert_topology
    x, y, z = self.device_coordinates[task, device, :]
ValueError: too many values to unpack (expected 3)

yinjunbo avatar Jun 03 '20 06:06 yinjunbo