lakeview
lakeview copied to clipboard
lakeview is a visibility tool for S3 based data lakes
lakeview
lakeview is a visibility tool for AWS S3 based data lakes.
Think of it as ncdu, but for Petabyte-scale data, on S3.
Instead of scanning billions of objects using the S3 API (which would require millions of API calls), lakeview uses Athena to query S3 Inventory Reports.
What can it do?
- Aggregate the sizes of directories* in S3, allowing you to drill down and find what is taking up space.
- Compare sizes between different dates - see how directories size change over time between different inventory reports.
- _Planned but not yet implemented - _ find the largest duplicates in your directories.
* S3, being an object store and not a filesystem, doesn't really have a notion of directories, but its API supports so-called "common prefixes".
All capabilities are provided in both a human consumable web interface and a machine consumable JSON report - feel free to plug them into your favorite monitoring tool.
What does it look like?
Size report:
Size diff:
Quickstart
-
Ensure you have an S3 inventory set up (preferably as Parquet or ORC)
-
Verify the table is registered in Athena
-
Run lakeview as a standalone Docker container:
docker run -it -p 5000:5000 \ -v $HOME/.aws:/home/lakeview/.aws \ treeverse/lakeview \ --table <athena table name> \ --output-location <s3 uri>
note
<athena table name>
is the name you gave in step 2, and<s3 uri>
is a location in S3 where Athena could store its results (e.g.s3://my-bucket/athena/
) -
Open http://localhost:5000/ and start exploring
Using lakeview as an API
API endpoint: /du
To get results as JSON - add Accept: application/json
to your request headers, or pass json
as a query string parameter.
Query Parameters:
prefix (default: "")
- return objects and directories[1] starting with the given prefix
delimiter (default: "/")
- use this character as delimiter to group objects under a common prefix
date
- date string corresponding to the inventory you'd like to query (YYYY-MM-DD-00-00) is S3's default structure
compare (optional)
- another date string. If present, lakeview will calculate a diff between the two reports for every common prefix and will sort the results based on the largest absolute diff
Example
Request:
http://localhost:5000/du?prefix=&delimiter=%2F&date=2020-08-23-00-00&compare=2020-08-22-00-00&json
Response:
{
"compare": "2020-08-22-00-00",
"date": "2020-08-23-00-00",
"delimiter": "/",
"prefix": "",
"response": [
{
"common_prefix": "users/",
"diff": 3363690400953,
"size_left": 231203538669496,
"size_right": 231203538669496
},
{
"common_prefix": "production/",
"diff": 2737293183914,
"size_left": 6238586023266733,
"size_right": 6238586023266733
},
{
"common_prefix": "staging/",
"diff": 281953288549,
"size_left": 367219795944457,
"size_right": 367219795944457
},
...
]
}
Building and running locally
Clone the repo, and from the root directory run:
$ pip install -r requirements.txt
and run this:
$ python server.py \
--table <athena table name> \
--output-location <s3 uri>
For a complete reference, run:
$ python server.py --help
License
lakeview is distributed under the Apache 2.0 license. See the included LICENSE file.
More information
lakeview was originally built (with <3) by Treeverse.
We're actively developing lakeFS as an open source tool that delivers resilience and manageability to object-storage based data lakes.