litgpt
litgpt copied to clipboard
[TPU] added numbers for TPU FLOPS
PR to add FLOPS numbers for benchmarking when using TPUs.
Follow up to https://github.com/Lightning-AI/lit-gpt/pull/147#discussion_r1230009025
cc @jekbradbury
@carmocca please take a look at the code snippet below:
>>> from torch_xla.experimental import tpu
>>> tpu.get_tpu_env()
{'ACCELERATOR_TYPE': 'v4-8', 'AGENT_BOOTSTRAP_IMAGE': 'REDACTED', 'ALT': 'false', 'CHIPS_PER_HOST_BOUNDS': '2,2,1', 'COLLECTD_DOCKER_URL': 'REDACTED', 'CONSUMER_PROJECT_ID': 'mlperf-high-priority-project', 'CONSUMER_PROJECT_NUMBER': '903354779218', 'CONTROL_MESSAGE_SOURCE': 'pubsub', 'ENABLE_ICI_RESILIENCY': 'false', 'ENABLE_IMPROVED_REROUTE_ALLREDUCE_STRATEGY': 'false', 'ENABLE_MEMCACHED': 'false', 'FLUENTD_DOCKER_URL': 'REDACTED', 'HEALTH_AGENT_DOCKER_URL': 'REDACTED', 'HOST_BOUNDS': '1,1,1', 'INFERENCE_MODE': 'false', 'INJECT_SLICE_BUILDER_FAULT': '', 'INTERNAL': 'true', 'MAINTENANCE_ACTION_FLAG': 'unhealthy-maintenance', 'MEMCACHED_DOCKER_URL': '', 'MONITORING_AGENT_DOCKER_URL': 'REDACTED', 'NODE_ID': 'REDACTED', 'PREEMPTIBLE': 'false', 'REPORTING_MODE': 'pubsub-and-metadata', 'RUNTIME_MONITOR_DOCKER_URL': 'REDACTED', 'RUNTIME_VERSION': 'REDACTED', 'RUNTIME_VERSION_CHANGER_DOCKER_URL': 'REDACTED', 'SERVICE_NAME': 'tpu.googleapis.com', 'SOURCE': '', 'TOPOLOGY': '2x2x1', 'TPU_CHIPS_PER_PROCESS_BOUNDS': '2,2,1', 'TPU_PROCESS_BOUNDS': '1,1,1', 'TPU_TOPOLOGY_ALT': 'false', 'TPU_TOPOLOGY_WRAP': 'false,false,false', 'TYPE': 'V4', 'UID': 'REDACTED', 'USE_DIRECT_PATH': 'false', 'WORKER_ID': '0', 'WRAP': 'false,false,false', 'ZONE': 'us-central2-b'}
If the key is ACCELERATOR_TYPE, then the value is v4-8. If the key is TYPE, the value is V4. Do we also want the number of cores (i.e, the num 8 in v4-8)? If not, then we can just stick with TYPE I think?
Oh yes, you're right. We multiply this number by the world size, so we don't want the number of cores: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_parrot/speed_monitor.py#L223