benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Default num_canonical_nodes to an even multiple of num_physical_nodes

Open micimize opened this issue 1 year ago • 3 comments

Not sure of the problematic math, but get_partitions will error out if num_canonical_nodes / num_physical_nodes is not a whole number. This could be resolved by making the default conditional, i.e

pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn

Example I saw when attempting to train a 350M gpt example on 6 nodes:

get_partitions(
    num_samples=364672,
    num_canonical_nodes=128,
    num_physical_nodes=6,
    ranks_per_node=4,
    workers_per_rank=1,
    batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)

micimize avatar Mar 08 '23 05:03 micimize

Thanks @micimize for raising this. The error message originates from streaming repository and it's not descriptive enough to let the user know what the actual issue is. The streaming repository will fix this with a better error message in the upcoming release.

karan6181 avatar Mar 09 '23 00:03 karan6181

@karan6181 that would improve things but I'm also wonder my approach for defaulting canonical nodes would be better than hardcoded 128? It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason

https://github.com/mosaicml/examples/blob/132ec02cc2c75e66a410f869b75697a6219fedae/examples/common/text_data.py#L61

micimize avatar Mar 09 '23 01:03 micimize

It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason

Canonical nodes is how many nodes you partition the sample space over. This stays the same even if your physical nodes changes. It is used to create an elastically deterministic sample order.

Your samples get laid out according to canonical nodes and then folded over onto physical nodes, so they have to be an even multiple of each other, or else you would get weird interleaving/striping of shards across nodes that would result in all shards being downloaded to all nodes, which is very bad and non-obvious.

To see the impact of various changes in parameters to get_partitions, you can visualize it using this script:

git clone https://github.com/mosaicml/streaming/
cd streaming/
pip3 install --user -e ".[dev]"
make web &
open http://localhost:1337/

knighton avatar Mar 09 '23 01:03 knighton