RVT
RVT copied to clipboard
Reproducing training procedure
Hello! I'm reproducing your training procedure and I'm trying to remove discrepancy between paper and code. In code settings https://github.com/NVlabs/RVT/blob/master/rvt/configs/rvt2.yaml there are 15 epochs and lr 1.25e-5, however in paper you used 10 epochs and lr 2.4e+3. Also batch size in configuration file is 24 and 192 in paper. Also, the model your provide has 99 postfix in naming, which suggests that you used 100 epochs during training. When I tried to train 100 epochs and kept lr and batch size intact, I got NaN on 54-th epoch.
I wonder what should I change in settings to reproduce your results. I really appreciate any help you can provide!
Hi,
Thanks for your interest in our work.
The configuration file here contains the official settings, except for a difference in the number of epochs.
Here’s how it aligns with the paper:
-
As stated in the paper, we train for approximately 80K steps, translating to ~100 epochs. So you are correct in updating to 100 epochs.
-
The paper mentions training on 8 V100 GPUs with a total batch size of 192. The config specifies per-GPU batch size, so a value of 24 results in an effective batch size of 192 (24 × 8).
-
Similarly, the learning rate (lr) in the config is specified per sample in the batch. Hence, for a batch size of 192, the specified lr of 1.25e-5 in the config is same as an effective lr of 2.4e-3 (1.25e-5 × 192), as mentioned in the paper.
Hope this helps!
Best, Ankit
Hi,
Thanks for your interest in our work.
The configuration file here contains the official settings, except for a difference in the number of epochs.
Here’s how it aligns with the paper:
- As stated in the paper, we train for approximately 80K steps, translating to ~100 epochs. So you are correct in updating to 100 epochs.
- The paper mentions training on 8 V100 GPUs with a total batch size of 192. The config specifies per-GPU batch size, so a value of 24 results in an effective batch size of 192 (24 × 8).
- Similarly, the learning rate (lr) in the config is specified per sample in the batch. Hence, for a batch size of 192, the specified lr of 1.25e-5 in the config is same as an effective lr of 2.4e-3 (1.25e-5 × 192), as mentioned in the paper.
Hope this helps!
Best, Ankit
Hi, Ankit,
Thanks for your valuable work.
I have a question about the `steps'. Why 80k steps are equivalent to ~100epochs, and how is this conversion calculated?
Hi,
Thanks for your interest in our work.
The configuration file here contains the official settings, except for a difference in the number of epochs.
Here’s how it aligns with the paper:
- As stated in the paper, we train for approximately 80K steps, translating to ~100 epochs. So you are correct in updating to 100 epochs.
- The paper mentions training on 8 V100 GPUs with a total batch size of 192. The config specifies per-GPU batch size, so a value of 24 results in an effective batch size of 192 (24 × 8).
- Similarly, the learning rate (lr) in the config is specified per sample in the batch. Hence, for a batch size of 192, the specified lr of 1.25e-5 in the config is same as an effective lr of 2.4e-3 (1.25e-5 × 192), as mentioned in the paper.
Hope this helps!
Best, Ankit
80k steps ohhh, each GPU iterates 80 steps, is that right?
Hi,
Apologies for the delayed response. Here are the relevant concepts:
— training_iterations: this is defined to match peract, loosely how many iterations are there in each epoch.
https://github.com/NVlabs/RVT/blob/367995a1a2169b6352bf4e8b0ed405890462a3a0/rvt/train.py#L173
so, for us, training_iterations is (16000/192) ~ 80.
— total steps: training_iterations * epochs https://github.com/NVlabs/RVT/blob/367995a1a2169b6352bf4e8b0ed405890462a3a0/rvt/train.py#L265-L271
so for us, total steps for 100 epochs is ~ 80k.
Hope this helps.
Best, Ankit