evidently icon indicating copy to clipboard operation
evidently copied to clipboard

Dataset size limitations?

Open EgorKraevTransferwise opened this issue 1 year ago • 1 comments

Hi, do I understand correctly that evidently uses Pandas under the hood? If so, does this constrain the size of the datasets that it can effectively work with? What's the largest dataset size you're aware of that's been successfully analyzed with evidently in a production context? Thanks a lot

EgorKraevTransferwise avatar Aug 03 '23 08:08 EgorKraevTransferwise

Hi @EgorKraevTransferwise,

Yes - currently, Evidently uses pandas. We know of users running Evidently on 100s thousand or low million rows per batch. Since the computation happens in memory, the exact limitation and computation time depend on your infrastructure and the exact Evidently metric(s) being used.

We are currently working on Spark support - this will be more suited for larger datasets. However, it requires re-implementing the underlying metrics, so we will likely release Spark support only for some metrics initially.

If this is relevant to your use case and you are considering using Evidently with Spark - let us know which metrics are most important (e.g., data quality, data drift, etc.)

elenasamuylova avatar Aug 03 '23 10:08 elenasamuylova