benchmarks
benchmarks copied to clipboard
Default num_canonical_nodes to an even multiple of num_physical_nodes
Not sure of the problematic math, but get_partitions
will error out if num_canonical_nodes / num_physical_nodes
is not a whole number. This could be resolved by making the default conditional, i.e
pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn
Example I saw when attempting to train a 350M gpt example on 6 nodes:
get_partitions(
num_samples=364672,
num_canonical_nodes=128,
num_physical_nodes=6,
ranks_per_node=4,
workers_per_rank=1,
batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)
Thanks @micimize for raising this. The error message originates from streaming repository and it's not descriptive enough to let the user know what the actual issue is. The streaming repository will fix this with a better error message in the upcoming release.
@karan6181 that would improve things but I'm also wonder my approach for defaulting canonical nodes would be better than hardcoded 128? It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason
https://github.com/mosaicml/examples/blob/132ec02cc2c75e66a410f869b75697a6219fedae/examples/common/text_data.py#L61
It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason
Canonical nodes is how many nodes you partition the sample space over. This stays the same even if your physical nodes changes. It is used to create an elastically deterministic sample order.
Your samples get laid out according to canonical nodes and then folded over onto physical nodes, so they have to be an even multiple of each other, or else you would get weird interleaving/striping of shards across nodes that would result in all shards being downloaded to all nodes, which is very bad and non-obvious.
To see the impact of various changes in parameters to get_partitions
, you can visualize it using this script:
git clone https://github.com/mosaicml/streaming/
cd streaming/
pip3 install --user -e ".[dev]"
make web &
open http://localhost:1337/