dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Support CSV as Metrics Data Format

Open mnrozhkov opened this issue 1 year ago • 7 comments

Summary:

We propose adding CSV file support as a metics file format. This feature will allow users to leverage the flexibility and familiarity of CSV, and DataFrame libraries widely used to calculate metrics.

Current Limitations:

Presently, DVC supports the formats JSON, TOML 1.0, or YAML 1.2 files. However, the absence of CSV support restricts its compatibility and integration with common data workflows. It's tedious to convert tabular data into nested JSON format.

Proposed Solution:

  • CSV File Processing: Enable DVC to read and convert CSV data into an internal format.
  • Grouping keys is the correct spelling.: Support nested keys to configure groupping metrics via CLI and dvc.yaml:metrics
  • Compatibility and Variation Handling: Ensure the tool can handle different CSV structures, including varying delimiters, missing values, and missing headers.
  • Error Handling: Provide clear messages for errors related to CSV formatting.

Benefits:

  • User-Friendly: CSV is a familiar format for many, making the tool more accessible.
  • Flexibility: CSV support allows for a broader range of data import and export options, accommodating diverse user workflows.

Use Case Example:

A data scientist needs to log metrics for a CV model (e.g. vehicle inspection) stored in CSV file.

  • CSV Example: Multi-indexed CSV with vehicle types and parts.
    Vehicle, Part, Accuracy, Count
    Car, Wheel, 0.3, 150
    Car, Bumper, 0.5, 200
    Truck, Glass, 0.5, 250
    Truck, Wheel, 0.1, 110
    
  • JSON Structure: The data is structured in JSON to reflect the vehicle-part relationship and metrics.
{
    "Car": {
        "Wheel": {"Accuracy": 0.3, "Count": 150},
        "Bumper": {"Accuracy": 0.5, "Count": 200}
        ...
    },
    "Truck": {
        "Glass": {"Accuracy": 0.5, "Count": 250},
        "Wheel": {"Accuracy": 0.1, "Count": 110}
        ...
    }
}
  • *dvc.yaml: Metrics configuration
metrics:
  - metrics.csv:
      keys: ["Vehicle", "Part]
      metrics: ["Accuracy", "Count"]

mnrozhkov avatar Dec 19 '23 11:12 mnrozhkov

Should params support CSV too?

skshetry avatar Dec 19 '23 15:12 skshetry

Should params support CSV too?

Don't think so. The only case when one may have a table of parameters I can imagine is hyper parameters tuning. But it's out of the scope. I have not heard such requests.

mnrozhkov avatar Dec 20 '23 09:12 mnrozhkov

Thanks @mnrozhkov! Note that we had an issue open until a couple weeks ago for this: https://github.com/iterative/dvc/issues/5409. It was open for almost 3 years, but there was no discussion or thumbs there, so let's keep in mind that general demand for this feature may be limited.

How should DVC know how to treat each column?

  • Should it try to infer the data type of each column and assume numeric columns are values and the others are keys? Unfortunately, CSV has no defined types, so not sure how we will do this without a heavier package like pandas.
  • What if the structure is not as simple as text columns on the left and numeric columns on the right?

dberenbaum avatar Dec 20 '23 20:12 dberenbaum

Good points @dberenbaum !

Should it try to infer the data type of each column and assume numeric columns are values and the others are keys? Unfortunately, CSV has no defined types, so not sure how we will do this without a heavier package like pandas.

  • I think we may assume that "metrics" fields contain numeric values and "keys" are converted to text

In my mind this following dvc.yaml config should be sufficient

metrics:
  - metrics.csv:
      keys: ["Vehicle", "Part]  
      metrics: ["Accuracy", "Count"]  
  • columns specified in keys can be any type, convert them to string, the order defines the structure (grouping order)
  • columns in metrics are expected to be numeric

What if the structure is not as simple as text columns on the left and numeric columns on the right

Do you have any specific example of such a complex structure?

mnrozhkov avatar Dec 22 '23 14:12 mnrozhkov

In my mind this following dvc.yaml config should be sufficient

Sorry, I misunderstood and thought you were suggesting that DVC infer the keys and metrics columns. If we specify those in the dvc.yaml, it makes sense. My only question would be the level of effort it would take (cc @skshetry).

dberenbaum avatar Dec 22 '23 15:12 dberenbaum

Just wanted to chime in and say that my company is running into this use case decently often across many repositories... mainly because we monitor the same metrics over many slices of the dataset.

uditrana avatar Dec 27 '23 01:12 uditrana

Folks, I suggest two simple steps first:

  • start raising an exception if file type is unsupported (vs silently treating files as YAML as we do now - which seems to be quite wrong)
  • for CSV/TSV:
    • always parse CSV/TSV with headers
    • consider the whole column as a value - single value is scalar, multiple values - array

I feel if people ready to do something like:

metrics:
  - metrics.csv:
      keys: ["Vehicle", "Part]
      metrics: ["Accuracy", "Count"]

(learning it itself is a lot of time) - they should be fine to dump as json - it’s +/- two (?) lines for code

Also, we would still need to do some default, in case people don't provide this schema. Raise exception? Treat as I described?

shcheklein avatar May 27 '24 16:05 shcheklein