weave icon indicating copy to clipboard operation
weave copied to clipboard

Why is uploading a dataset to Weave being counted towards the ingestion usage?

Open oekekezie opened this issue 8 months ago • 3 comments

The pricing page seems to suggest that the ingestion usage is supposed to reflect how much data is used by tracing activities: https://wandb.ai/site/pricing/

However, my account is clearly counting uploading datasets to my ingestion usage? The datasets I uploaded yesterday (5/1) were a total of about 800MB. I ran the same eval three times yesterday (5/1) and the traces should've been around 2MB total. See screenshots below.

Is this the intended behavior? That would be a bit surprising given that the cost of each additional GB over 1.5GB on the Pro plan is $100 per GB yet the plan is supposed to come with 100GB of "storage" per month... If you actually uploaded 100GB worth of datasets to storage you would have blown through your Weave ingestion limit and ended up being billed about $1000 in overages? This seems like a bug or no?

At the very least, if this is the intended behavior / not a bug, should the documentation clearly explain this? https://weave-docs.wandb.ai/guides/core-types/datasets

Image Image

oekekezie avatar May 02 '25 12:05 oekekezie

@oekekezie Hey great question i'll get back to you asap with clarification.

gtarpenning avatar May 02 '25 15:05 gtarpenning

@oekekezie

I got to the bottom of this; one immediate note i'm making is that this could be made much clearer. The storage costs referenced on the pricing: https://wandb.ai/site/pricing/, are only relevant for the W&B Models product.

However, for Weave an ingested byte is an ingested byte, including datasets and other media. As of April 18th, overage charges for the Pro edition are (by default) 10 cents a MB. For example, if you upload a 2GB dataset, that will be billed at (2GB - 1.5GB free) * $0.1 / MB = $51.2.

I realize this makes operating on large datasets much less attractive; if you plan on running big evaluations or uploading more datasets please reach out to this email and we can come up with a solution for your specific use case.

gtarpenning avatar May 02 '25 15:05 gtarpenning

Thanks for the response @gtarpenning. It does indeed make using Weave to store large datasets much less attractive. I think I'll rely on Git LFS for now. For context, my use case involved splitting up a very large synthetic dataset into different splits. As a hack, I think I'll try only storing the IDs for each split in Weave as opposed to the complete synthetic examples and use the description field to keep track of the corresponding dataset if that makes sense? That being said, should I be using a combination of W&B Core and Weave? This (https://docs.wandb.ai/guides/artifacts/) seems like it could be relevant but the Weave documentation makes it seem as though Weave should be a standalone product?

Regardless, I'd suggest making the pricing information and Weave documentation much clearer regarding what counts towards ingestion:

From: https://wandb.ai/site/pricing/ How is Weave data ingestion calculated?

We define ingested bytes as bytes that we receive, process, and store on your behalf. This includes trace metadata, LLM inputs/outputs, and any other information you explicitly log to Weave, but does not include communication overhead (e.g., HTTP headers) or any other data that is not placed in long term storage. We count bytes as "ingested" only once at the time they are received and stored.

There's no mention here that uploading a dataset would count towards ingestion usage, so I would add it here.

I would also add it to the documentation here: https://weave-docs.wandb.ai/guides/core-types/datasets

oekekezie avatar May 02 '25 16:05 oekekezie