determined
determined copied to clipboard
feat: implement profiler in Core API [MD-10] [MD-302]
Description
Implement the system metric profiling functionality in Core API.
This is a complete rewrite of the old ProfilerAgent
. Timing metrics functionality was removed and system metrics are now being reported to the generic metrics backend.
Test Plan
As this PR only contains the harness/python changes, testing should be done manually and requires direct access to the database. After testing each of the entrypoints for profiler, query the database and make sure the new system metrics data is present.
select * from metrics where trial_id=TRIALID and partition_type='PROFILING';
Core API
import time
import logging
import determined as det
from determined import core
def main(core_context):
core_context.profiler.on()
for batch in range(100):
steps_completed = batch + 1
if steps_completed % 5 == 0:
core_context.train.report_training_metrics(
steps_completed=steps_completed, metrics={"x": batch}
)
if steps_completed % 10 == 0:
core_context.train.report_validation_metrics(steps_completed=steps_completed, metrics={"x": batch})
time.sleep(1)
core_context.profiler.off()
if __name__ == "__main__":
logging.basicConfig(level=logging.DEBUG, format=det.LOG_FORMAT)
with core.init() as core_context:
main(core_context=core_context)
Trainer API (PyTorch)
Run the MNIST example in examples/tutorials/mnist_pytorch
with trainer.fit(...profiling_enabled=True)
TFKeras (harness)
Submit a TFKeras experiment with profiling configs in the experiment config:
profiling:
enabled: true
Commentary (optional)
Checklist
- [ ] Changes have been manually QA'd
- [ ] User-facing API changes need the "User-facing API Change" label.
- [ ] Release notes should be added as a separate file under
docs/release-notes/
. See Release Note for details. - [ ] Licenses should be included for new code which was copied and/or modified from any external code.