metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Feature Request : Printing information about recent runs via CLI

Open valayDave opened this issue 4 years ago • 1 comments

Context :

The resume functionality in Metaflow has a few nuances that get noted once used often. When resuming a flow, Metaflow clones the artifacts of the "resuming" flow. If the flow is really large and I cancel the resume before it cloned everything then the immediate next python myflow.py resume will resume a partially "resumed" flow.

User Story :

Sometimes when I am resuming a flow, I would suddenly have a new idea or realize some mistake. At such a moment I would cancel that resume without paying any attention to the run-id (And many times even clear my shell before I remember that I need the right run-id to resume again). At such times when I immediately resume the flow, it wouldn't be the one I "intended" to resume. This has sometimes caused a cascade of wrong resumes. At that point, I have to go to my notebook/python and run some custom code to check which run-id should I resume from.

This is a pain point I faced many times because I don't pay attention to run-id. A CLI command which shows all recent runs makes this problem go away :)

valayDave avatar Jul 03 '21 06:07 valayDave

Notes

From Savin:

we can definitely introduce python flow.py list runs

a bigger question would be the format of the output. You can take a look at python flow.py batch list (when running tasks on AWS Batch) for inspiration.

One clue, when I run this command and there are no runs, I get the message No running AWS Batch jobs found. which at least allows me to potentially introspect that bit of code as an example.

The output of python flow.py batch list looks like this

Metaflow 2.6.0.post3+git3acd855 executing MyFlow for user:hamel
hamel-MyFlow-19901-start-240953-0 [450099f2-01c1-4b75-9efb-74df86790009] (SUBMITTED)

What is the code path for python flow.py batch list ? How does this access this list of current runs that are running in batch?

To figure this out, I run viztracer --vdb --max_stack_depth 15 --include_files ~/github/metaflow -- flow.py batch list

Which, gives me this call stack:


1. invoke (metaflow/_vendor/click/core.py:1221)
2. builtins.exec
3. <module> (flow.py:1)
4. __init__ (metaflow/flowspec.py:81)
5. main (metaflow/cli.py:1087)
6. __call__ (metaflow/_vendor/click/core.py:827)
7. main (metaflow/_vendor/click/core.py:716)
8. invoke (metaflow/_vendor/click/core.py:1060)
9. invoke (metaflow/_vendor/click/core.py:572)
10. new_func (metaflow/_vendor/click/decorators.py:20)
11. list (metaflow/plugins/aws/batch/batch_cli.py:52)
12. _execute_cmd (metaflow/plugins/aws/batch/batch_cli.py:29)
13. list_jobs (metaflow/plugins/aws/batch/batch.py:123)
14. _search_jobs (metaflow/plugins/aws/batch/batch.py:92)
15. <genexpr> (metaflow/plugins/aws/batch/batch_client.py:37)
16. <genexpr> (metaflow/plugins/aws/batch/batch_client.py:28)

To double-check my understanding, I searched the codebase for the error message No running AWS Batch jobs found. and indeed found this in two places, one of them in batch.batch.list_jobs which is # 13 in the call stack above)

All this information is probably overkill for what I'm trying to do, but I wanted to get some idea of the flow of at least one feature before I started to add more code somewhere.


In order to list the recent runs, I need to figure out how to get the name of the flow. By chance, I spot something in the CLI called ctx.obj.flow.name which seems like what I want.

Next step is to add list runs somehow to the CLI

One important thing to understand to add a feature to metaflow is the CLI interface. Metaflow uses advanced features of click, such as Custom Multi Commands. So it's worth reading through that.

One potential source of confusion is where exactly should I add this new feature? Should I create a separate module within the metaflow codebase or add it to an existing one? I decided to make it a separate module for now in order to make forward progress, and create a CLI that says responds with hello world when calling the command python flow.py list runs

Here is an example minimal implementation of adding a new cli command:

Note that in order to print something, I noticed from studying other interfaces that there is an echo method on the obj which is being passed around by ctx so I am just using this to conform to the design pattern that is elsewhere in the code:

  1. In metaflow/plugins add a file list_cli.py:
from metaflow._vendor import click

@click.group()
def cli():
    pass


@cli.group(help="List recent entities pertaining to your flow.")
def list():
    pass


def _execute_cmd(echo):
    echo("Hello World!")


@list.command(help="List recent runs for your flow.")
@click.pass_context
def runs(ctx):
    _execute_cmd(ctx.obj.echo)
  1. Register the new cli interface in metaflow/plugins/__init__.py in a function called get_plugin_cli:
def get_plugin_cli():
    # it is important that CLIs are not imported when
    # __init__ is imported. CLIs may use e.g.
    # parameters.add_custom_parameters which requires
    # that the flow is imported first

    # Add new CLI commands in this list
    from . import package_cli
    from .aws.batch import batch_cli
    from .kubernetes import kubernetes_cli
    from .aws.step_functions import step_functions_cli
    from .argo import argo_workflows_cli
    from .cards import card_cli
+    from  . import list_cli

    return _ext_plugins["get_plugin_cli"]() + [
        package_cli.cli,
        batch_cli.cli,
        card_cli.cli,
        kubernetes_cli.cli,
        step_functions_cli.cli,
        argo_workflows_cli.cli,
+        list_cli.cli
    ]


Next step is to list the runs from the current flow instead of printing Hello World! :) So how do we do this? I could use the metaflow client API, but I should I import the user facing API or should I use something internal? Doing some digging around, I couldn't find anything using some kind of private API nor could I brainstorm an existing CLI feature that needs something like the client API, so I'll proceed using the public one. (I could be wrong on this)


In order to list flows, this is a rough first sketch, most of the work is done in side _execute_cmd and tries to follow design patterns I have found elsewhere in the code, I also added the option to specify --num-runs

  • I don't think we are using f-strings for backward python compatibility, so I avoided those
  • I used the (..) and [...] nomenclature for printing various things about runs as closely as I could that I gleaned from elsewhere in the code.
  • I wanted to fail gracefully if no run was found which is why I am catching the MetaflowNotFound exception. I am not sure this is the right choice, but this is choice probably better discussed in a pull request
from metaflow._vendor import click
from metaflow import Flow
from metaflow.exception import MetaflowNotFound


@click.group()
def cli():
    pass


@cli.group(help="List objects pertaining to your flow.")
def list():
    pass


def _execute_cmd(echo, flow_name, num_runs):
    found = False
    counter = 1
    try:
        flow = Flow(flow_name)
    except MetaflowNotFound:
        flow = None

    if flow:
        for run in Flow(flow_name).runs():
            found = True
            if counter > num_runs:
                break
            counter += 1
            echo(
                "{created} {name} [{id}] (Successful:{status} Finished:{finished})".format(
                    created=run.created_at,
                    name=flow_name,
                    id=run.id,
                    status=run.successful,
                    finished=run.finished,
                )
            )

    if not found:
        echo("No runs found for flow: {name}".format(name=flow_name))


@click.option(
    "--num-runs",
    default=10,
    help="Number of runs to show.",
)
@list.command(help="List recent runs for your flow.")
@click.pass_context
def runs(ctx, num_runs):
    _execute_cmd(ctx.obj.echo, ctx.obj.flow.name, num_runs)

The next step is to figure out how to add tests.


@valayDave helped me to figure out how tests work. Some useful pointers. You have to subclass a special kind of Flowspec called MetaflowTest to create a flow that will test Metaflow. Yes Metaflow to test Metaflow 🚀 . At a high level the testing framework does a cartesian combination between many features, for example, so test all possible permutations of different types of flows and decorators

When you see a step with decorator @steps(0, ["foreach-nested-inner"]) this is something that happens in each step for all flows of type foreach-nested-inner. This happens during the execution of the flow and is added onto the steps.

The check_results method of a MetaflowTest is executed after the execution of the various flows, which you can use to inspect results after the flows are complete, which Is what I need to use in this case.

hamelsmu avatar May 09 '22 13:05 hamelsmu