OpenPCDet icon indicating copy to clipboard operation
OpenPCDet copied to clipboard

Training on Waymo seems not to lead to convergence

Open zppppppx opened this issue 1 year ago • 0 comments

Very great work! It is fast in inference and light-weighted.

When I train the model using waymo (after processing it using your codes), the loss seems not to be decreasing. I am not sure where could go wrong.

I made two major modifications:

  1. when processing the waymo dataset, the original code is doing some mapping, but it returned errors, so I changed it to the following format.
def process_single_sequence(sequence_file, save_path, sampled_interval, client, has_label=True, use_two_returns=True):
    sequence_name = os.path.splitext(os.path.basename(sequence_file))[0]

    # print('Load record (sampled_interval=%d): %s' % (sampled_interval, sequence_name))
    if not client.exists(sequence_file):
        print('NotFoundError: %s' % sequence_file)
        return []

    # dataset = tf.data.TFRecordDataset(client._map_path(sequence_file), compression_type='')
    dataset = tf.data.TFRecordDataset(str(sequence_file), compression_type='')
    cur_save_dir = save_path / sequence_name
    cur_save_dir.mkdir(parents=True, exist_ok=True)
  1. for dist_train.sh, I changed it to be the same format as OpenPCDet:
#!/usr/bin/env bash
set -x
NGPUS=$1
PY_ARGS=${@:2}

echo "#######################################" $PY_ARGS

while true
do
    PORT=$(( ((RANDOM<<15)|RANDOM) % 49152 + 10000 ))
    status="$(nc -z 127.0.0.1 $PORT < /dev/null &>/dev/null; echo $?)"
    if [ "${status}" != "0" ]; then
        break;
    fi
done
echo $PORT

python3 -m torch.distributed.launch --nproc_per_node=${NGPUS} --master_port $PORT train.py --launcher pytorch ${PY_ARGS}

The training log is attached.

I am trying to train it again using the original dist_train.sh, but it still does not show a trend for convergence. train-waymo-pvt-ssd.log

zppppppx avatar Jan 25 '24 23:01 zppppppx