datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Try dataset pull on large dataset and measure performance

Open ilongin opened this issue 1 year ago • 1 comments

We need to make sure datachain pull works on real user scenarios. We also need to measure performance, i.e how much overhead we are bringing if user would do just plain export - import by himself. Datasets of 10M and larger should be tested

ilongin avatar Nov 13 '24 05:11 ilongin

Some times for pulling without instantiating from Studio production (team name: demo-1):

  1. ds://laion_wds_1m (1M objects, 14 custom signals) : ~6k rows/sec
  2. ds://laion_wds (11.5M objects, 14 custom signals) : ~6.7k rows/sec
  3. ds://laion (11.5M objects, 48 custom signals) : ? rows/sec

--- IN PROGRESS ---

ilongin avatar Nov 14 '24 13:11 ilongin