iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Python: Add a CLI to go through the catalog

Open Fokko opened this issue 3 years ago • 3 comments

A CLI using Click for argument parsing, and Rich for nice formatting. When installing pyiceberg, the pyiceberg executable will be available on the PATH to go through the catalog.

➜  python git:(fd-python-cli) ✗ pyiceberg list-namespace                   
 Namespaces  
┏━━━━━━━━━━━┓
┃ Namespace ┃
┡━━━━━━━━━━━┩
│ default   │
│ examples  │
│ fokko     │
│ system    │
└───────────┘
➜  python git:(fd-python-cli) ✗ pyiceberg list-tables fokko
     Tables     
┏━━━━━━━━━━━━━━┓
┃ Table name   ┃
┡━━━━━━━━━━━━━━┩
│ fokko.fokko  │
│ fokko.fokko2 │
│ fokko.fokko3 │
└──────────────┘
➜  python git:(fd-python-cli) ✗ pyiceberg load-namespace fokko                  
    fokko     
  properties  
┌──────┬─────┐
│ prop │ yes │
└──────┴─────┘
➜  python git:(fd-python-cli) ✗ pyiceberg list-tables examples 
       Tables        
┏━━━━━━━━━━━━━━━━━━━┓
┃ Table name        ┃
┡━━━━━━━━━━━━━━━━━━━┩
│ examples.fooshare │
└───────────────────┘
➜  python git:(fd-python-cli) ✗ pyiceberg load-table examples.fooshare 
                                                                                  examples.fooshare                                                                                  
┌──────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Table format version │ 1                                                                                                                                                          │
│ Metadata location    │ s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/b55d9dda-6561-423a-8bfc-787980ce421f/metadata/00001-5f2f8166-244c-4eae-ac36-384ecd… │
│ Table UUID           │ b55d9dda-6561-423a-8bfc-787980ce421f                                                                                                                       │
│ Last Updated         │ 1646787054459                                                                                                                                              │
│ Partition spec       │ []                                                                                                                                                         │
│ Sort order           │ order_id=0 fields=[]                                                                                                                                       │
│ Schema               │ Schema                                                                                                                                                     │
│                      │ ├── 1: id: optional int                                                                                                                                    │
│                      │ └── 2: data: optional string                                                                                                                               │
│ Snapshots            │ Snapshots                                                                                                                                                  │
│                      │ └── Snapshot 0:                                                                                                                                            │
│                      │     s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/b55d9dda-6561-423a-8bfc-787980ce421f/metadata/snap-3497810964824022504-1-c4f68… │
│ Properties           │ ┌──────────────────────────────────┬───────┐                                                                                                               │
│                      │ │ owner                            │ bryan │                                                                                                               │
│                      │ │ write.metadata.compression-codec │ gzip  │                                                                                                               │
│                      │ └──────────────────────────────────┴───────┘                                                                                                               │
└──────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
➜  python git:(fd-python-cli) ✗ pyiceberg rename-table fokko.fokko fokko.renamed
Table fokko.fokko has been renamed to fokko.renamed
➜  python git:(fd-python-cli) ✗ pyiceberg list-tables fokko   
     Tables      
┏━━━━━━━━━━━━━━━┓
┃ Table name    ┃
┡━━━━━━━━━━━━━━━┩
│ fokko.fokko2  │
│ fokko.fokko3  │
│ fokko.renamed │
└───────────────┘
➜  python git:(fd-python-cli) ✗ pyiceberg drop-table fokko.renamed
Table fokko.renamed has been dropped

Fokko avatar Aug 02 '22 10:08 Fokko

I like this a lot, but I think we should make some adjustments so that the output is usable with other CLI tools. It feels weird to work with a CLI that pretty-prints and can't be used in combination with awk and grep easily. I made some comments on the implementation for this. I think just using Table.grid in most places is a good compromise between the trade-offs. And this should route error text to stderr as well.

I also thought about the command structure for a while. I definitely prefer not to mirror the API directly, unless we have an API where that is the expectation (like, the aws s3api functionality). Most people will use this CLI to do something, like explore a catalog or find information on a table. I think the CLI should be written to make those use cases effective and easy.

For example, currently there's a load-table command that corresponds to load_table in the catalog API. But loading a table is just a first step in a series of actions to do something useful. It doesn't make sense to me to load a table as a CLI command because the caller is coming with a purpose beyond that first step, like looking at the columns of a table and their documentation. It may be that the user wants to see the schema, the partitioning, table properties, or maybe a summary of everything. All of those cases load the table, but showing the same information each time is distracting.

I think that the CLI should have distinct commands that are more focused on a purpose, like schema to show a table schema, or properties to show table or namespace properties. Like these, for example:

pyiceberg schema db.table
pyiceberg spec db.table
pyiceberg order db.table
pyiceberg uuid db.table
pyiceberg location db.table
pyiceberg properties db.table

We probably do want a summary or describe command that shows more information.

It's also strange for the to tell the API what type of object is being operated on or expected. That's another artifact of mirroring the catalog API. When I use ls, I don't need to tell the command that I want to list a directory or a symlinked directory. I also don't need to tell ls what I'm looking for, it just lists everything. I think pyiceberg should work the same way. Rather than having list-tables and list-namespaces, I think we should have a single list command that shows the output of both API calls. We can use [bold blue] to format namespaces and use filters, like --tables, to restrict the type of objects shown. Commands that I think would work with both namespaces and tables are describe, properties, set, and remove:

pyiceberg describe db
pyiceberg describe db.table
pyiceberg properties db
pyiceberg properties db.table
pyiceberg set db properties a=b
pyiceberg set db.t1 properties a=b
pyiceberg remove db properties a b
pyiceberg remove db.t1 properties a b

The exception to that are the create and drop commands, which should probably be explicit about what you're dropping (like rm vs rmdir): pyiceberg drop table db.table.

rdblue avatar Aug 07 '22 21:08 rdblue

I think we're almost there. I marked it as a draft because it is still a bit dirty. For example, setting properties is still missing, and it had some rough edges. Would be great if we can merge in some of the other PRs that also allow us to clean this one up:

Also, have to add some tests because we're below the required test coverage. Thinking of injecting a mock catalog.

Fokko avatar Aug 10 '22 15:08 Fokko

Okay, I think I've reviewed all of the PRs. Lots of stuff going in!

rdblue avatar Aug 10 '22 18:08 rdblue

Added a gazillion tests, hopefully, this will bump it up to 90%+.

For the json API, we could also return only pydantic models, and we get an open-api for free from that :) Might be a bit too much for this stage, but I would like to share the idea to see what others think.

Fokko avatar Aug 13 '22 21:08 Fokko

Looks great! I know we're still waiting on config, but this is usable right now so I merged it.

rdblue avatar Aug 15 '22 22:08 rdblue

@rdblue Fine by me, thanks for merging 👍🏻

Fokko avatar Aug 16 '22 06:08 Fokko