duckdb-web icon indicating copy to clipboard operation
duckdb-web copied to clipboard

Issue found on page 'CSV Import', section "Parameters": `force_not_null` could benefit from clearer wording and crosslink to `nullstr`-parameter

Open rpbouman opened this issue 6 months ago • 0 comments

I found the documentation for the force_not_null parameter to read_csv hard to understand. The parameter name suggests a BOOLEAN value, but it's actually a list of VARCHARs. The description reads:

Do not match the specified columns' values against the NULL string. In the default case where the NULL string is empty, this means that empty values will be read as zero-length strings rather than NULLs.

From description it is not immediately clear to what kind of thing is specified through this parameter. The construct "the specified columns' values" is ambiguous - it could either mean "columns" or "values". Because the rest of the description also focuses on values and value matching, I thought it was about values.

Now I believe that the force_not_null parameter allows one to specify a list of column names, and that its effect is that for those specified columns, the reader will not attempt to match the column's values to those specified by the nullstr parameter (which would extract those literal values as SQL NULL values)

I suggest a rewording of the description that explicitly says the force_not_null parameter takes a list of column names as value, and to also literally name the nullstr parameter to which it corresponds. Something like this:

By default, the CSV reader compares each column's raw values against the NULL-strings to substitute them with SQL NULL-values. The force_not_null parameter can be used to specify a list of column names for which the reader should not apply any NULL-string substitution. The set of NULL-strings can be specified by the nullstr-parameter. By default, the nullstr-parameter is set to the empty string, thus substituting empty raw values with SQL NULL-values. Columns that need to retain the literal empty-string values should be passed in the force_not_null parameter.

And for completeness, the nullstr-parameter should also get some text to establish the relationship with the force_not_null parameter. I think this could be done by adding this to its description:

The substitution of nullstr-parameter values with SQL NULL-values can be suprressed by passing the names of those columns in the force_not_null-parameter. If the nullstr-parameter includes the empty string (default), then the columns passed to the force_not_null-parameter will retain any empty string values instead of substituting them with SQL NULL-values.

Page URL: https://duckdb.org/docs/data/csv/overview#parameters

rpbouman avatar Aug 14 '24 14:08 rpbouman