pinecone-datasets icon indicating copy to clipboard operation
pinecone-datasets copied to clipboard

Speedup list_datasets() by 2.5x

Open daverigby opened this issue 1 year ago • 0 comments

Problem

Construction of the Catalog object currently takes ~7.1s to complete. This is significant as both list_datasets() and load_dataset() require the construction of a Catalog object; so essentially any operation with pinecone_datasets has a startup overhead of over 7s.

Looking at where this time is spent, we see that the underlying gcsfs RPC library is issing a large number of HTTP requests, and some repeatedly to the same URL. Specifically, we are issuing two GCS GET requests per dataset bucket - for example to access ANN_DEEP1B_d96_angular we observe the following calls (displayed by setting GCSFS_DEBUG=DEBUG env var):

2024-02-09 11:54:35,635 - gcsfs - DEBUG - _call -- GET: b/{}/o/{}, ('pinecone-datasets-dev', 'ANN_DEEP1B_d96_angular/metadata.json'), None
2024-02-09 11:54:35,749 - gcsfs - DEBUG - _call -- GET: https://storage.googleapis.com/download/storage/v1/b/pinecone-datasets-dev/o/ANN_DEEP1B_d96_angular%2Fmetadata.json?alt=media, (), {'Range': 'bytes=0-440'}

We also end up issuing multiple calls to list the bucket contents - e.g. there are 11 calls of the form:

2024-02-09 11:54:35,433 - gcsfs - DEBUG - _call -- GET: b/{}/o, ('pinecone-datasets-dev',), None

In total we see 81 HTTP calls to construct a Catalog object comprising of 25 datasets.

Solution

Improve this by using gcsfs' higher-level fs.cat() method to directly read all metadata.json files without manually iterating over the bucket objects. This results in a much simpler set of calls - two calls to list the bucket content, then one call per dataset:

2024-02-09 11:54:00,715 - gcsfs - DEBUG - _call -- GET: b/{}/o, ('pinecone-datasets-dev',), None
2024-02-09 11:54:03,139 - gcsfs - DEBUG - _call -- GET: b/{}/o, ('pinecone-datasets-dev',), None
2024-02-09 11:54:04,337 - gcsfs - DEBUG - _call -- GET: https://storage.googleapis.com/download/storage/v1/b/pinecone-datasets-dev/o/ANN_DEEP1B_d96_angular%2Fmetadata.json?alt=media, (), {}
2024-02-09 11:54:04,338 - gcsfs - DEBUG - _call -- GET: https://storage.googleapis.com/download/storage/v1/b/pinecone-datasets-dev/o/ANN_Fashion-MNIST_d784_euclidean%2Fmetadata.json?alt=media, (), {}
...

The total the number of HTTP calls is reduced to 26. This has a corresponding reduction in wall-clock time to struct to 3.1s

Type of Change

  • [x] New feature (non-breaking change which adds functionality)

Test Plan

Regression test using existing unit tests. Peformance impact measured by running list_datasets() before and after.

daverigby avatar Feb 09 '24 12:02 daverigby