zsv Lack of documentation and unclear API is a real drawback.

Hello,

I'm very interested in using zsvlib into a C project.

I'm looking for a way to use the library in order to achieve the same result as the command zsv select file.csv -- column_title to convert a column into a char**.

However, I find the API unclear, and the applications source code seems quite convoluted, to not say obscure. I'm starting to understand that a zsvlib “Hello World” should start with:

struct zsv_opts* opts = zsv_get_default_opts();
zsv_parser parser = zsv_new(opts);

But, again, I feel like I'm at the beginning of a long journey before I finally feel comfortable with the API, and start producing the result I want to achieve.

I'm hoping to use this library in the future, so I wanted to signal this difficulty.

Best regards.

Sep 13 '22 00:09 malespiaut

@malespiaut -- thanks for the feedback. Completely understand and agree, which is one of the reasons this repo is still in alpha.

Will at least look to add a basic example and more documentation in zsv/api.h in the next commit, which likely is within a week. Appreciate any further suggestions as to which of the many documentation gaps would be most useful to you for us to fill in first.

If you are using the API, you do not need zsv_get_default_opts(), which is primarily a convenience function used for fine-tuning default behavior such as setting a default callback that is invoked upon parse finalization.

In the meantime, does the below simple example help?

/**
 * Example program: reads CSV rows, outputs number of non-blank cells
 * compile with e.g. gcc -O3 -o simple simple.c -lzsv -DZSV_EXTRAS
 *     Note: remove -DZSV_EXTRAS if you configured/compiled/installed libzsv using --minimal=yes
 * Example:
 *   `echo "abc,def,,\n,ghi,,,," | ./simple`
 * outputs:
 *   Row 0 has 4 columns of which 2 are non-blank
 *   Row 0 has 6 columns of which 1 are non-blank
 * 
 */

#include <stdio.h>
#include <zsv.h>

struct my_data {
  zsv_parser parser;
  size_t rows;
};

void my_row_handler(void *ctx) {
  struct my_data *data = ctx;
  size_t column_count = zsv_column_count(data->parser);
  size_t nonblank = 0;
  for(size_t i = 0; i < column_count; i++)
    if(zsv_get_cell(data->parser, i).len > 0)
      nonblank++;
  printf("Row %zu has %zu columns of which %zu are non-blank\n", data->rows, column_count, nonblank);
}

int main() {
  struct my_data data = { 0 };
  struct zsv_opts opts = { 0 };
  opts.row = my_row_handler;
  opts.ctx = &data;
  data.parser = zsv_new(&opts);
  while(zsv_parse_more(data.parser) == zsv_status_ok)
    ;
  zsv_finish(data.parser);
  zsv_delete(data.parser);
  return 0;
}

Sep 13 '22 01:09 liquidaty

@malespiaut Or, here's another example that is closer to what you indicated you are trying to do:

/**
 * Example program: reads CSV rows, outputs number of non-blank cells
 * Example compilation command:
 *   gcc -O3 -o print_my_column print_my_column.c -lzsv -DZSV_EXTRAS
 *   (Note: remove -DZSV_EXTRAS if you configured/compiled/installed libzsv using --minimal=yes)
 * Example:
 *   `echo "hi,there,you\na,b,c\nd,e,f" | ./print_my_column there
 * Outputs:
 *   there
 *   b
 *   e
 */

#include <stdio.h>
#include <zsv.h>
#include <string.h>

struct my_data {
  zsv_parser parser;
  size_t column_to_find_len;
  const char *column_to_find;
  size_t column_to_find_position; // 1-based
  char aborted;
};

/**
 * Output data from the selected column
 */
void print_my_column(void *ctx) {
  struct my_data *data = ctx;
  struct zsv_cell c = zsv_get_cell(data->parser, data->column_to_find_position - 1);
  printf("%.*s\n", (int)c.len, c.str);
}

/**
 * In the first row, find the position of the column I want to output
 * Stop with an error message if not found
 */
void find_my_column(void *ctx) {
  struct my_data *data = ctx;
  size_t column_count = zsv_column_count(data->parser);
  for(size_t i = 0; i < column_count; i++) {
    struct zsv_cell c = zsv_get_cell(data->parser, i);
    if(c.len == data->column_to_find_len && !memcmp(data->column_to_find, c.str, c.len)) {
      data->column_to_find_position = i + 1;
      break;
    }
  }
  if(!data->column_to_find_position) {
    fprintf(stderr, "Could not find column %.*s\n", (int)data->column_to_find_len, data->column_to_find);
    zsv_abort(data->parser);
    data->aborted = 1;
  } else {
    printf("%s\n", data->column_to_find);
    zsv_set_row_handler(data->parser, print_my_column);
  }
}

int main(int argc, const char *argv[]) {
  struct my_data data = { 0 };
  struct zsv_opts opts = { 0 };
  opts.row = find_my_column;
  opts.ctx = &data;

  if(argc < 2)
    return fprintf(stderr, "Usage: print_my_column column_name < input.csv\n");

  data.column_to_find = argv[1];
  data.column_to_find_len = strlen(data.column_to_find);

  data.parser = zsv_new(&opts);
  while(zsv_parse_more(data.parser) == zsv_status_ok)
    ;
  zsv_finish(data.parser);
  zsv_delete(data.parser);

  return data.aborted;
}

Sep 13 '22 01:09 liquidaty

Hello,

Thank you infinitely for your answer.

Before anything, I would like to precise that zsv has been extremely helpful, as I wanted to speed up a shell script that I wrote, using csvcut (made in Python). xsv turned out to malfunction, as it was outputting incorrect result, for some reason, while zsv did the job perfectly. In the end, replacing csvcut with zsv made my script complete in more than 15 minutes, to less than 14 seconds. A huge thank you for that relief!

The minimalist source code you have posted highlights, in my humble opinion, how some things are not intuitive.

The role and content of zsv_opts must be documented;
The row field could possibly have another name, like row_fn_ptr for “Row function pointer”;
The same goes for ctx, which is too short of a name;
zsv_parse_more() isn't a clear name, again, in my humble opinion. Why parse more? For what?

I may not have the best advice for function and variable naming, but these are question I have asked myself upon reading the code of the first example.

Sep 13 '22 16:09 malespiaut

replacing csvcut with zsv made my script complete in more than 15 minutes, to less than 14 seconds. A huge thank you for that relief!

So glad to hear that. If you feel like it, please give the repo a star and/or help spread the word! If you're working on Mac, then once we hit 75 stars we will be able to make zsv available via brew without a custom tap.

The role and content of zsv_opts must be documented

Understood. FYI the current place where we have begun to add this documentation is at https://github.com/liquidaty/zsv/blob/main/include/zsv/common.h and https://github.com/liquidaty/zsv/blob/main/include/zsv/api.h , but still have lots of wood to chop in terms of adding content and then generating proper documentation from it

The row field could possibly have another name, like row_fn_ptr for “Row function pointer” The same goes for ctx zsv_parse_more() isn't a clear name

Noted, will give all of these some thought.

Thank you for your feedback! Please feel free to send any further questions or feedback any time. I will update this issue when we have followed up on the above.

Sep 13 '22 17:09 liquidaty

Posted additional examples and code at examples/lib/README.md, with links to full examples. Given these changes, closing this ticket for now.

Auto-generated documentation will be forthcoming in a few weeks. Please feel free to re-open with further comment once that is available, or before then if any apparent gap in these done or contemplated changes

Nov 01 '22 06:11 liquidaty

@malespiaut just added pull parsing API, which if I'm guessing correct, you'll appreciate. No need for callbacks:

zsv_parser parser = zsv_new(...);
while(zsv_next_row(parser) == zsv_status_row) { /* for each row */
    // do something
}

See examples/lib/README.md for snippets and full code examples

Nov 12 '22 08:11 liquidaty