sqlite-utils icon indicating copy to clipboard operation
sqlite-utils copied to clipboard

CSV `extras_key=` and `ignore_extras=` equivalents for CLI tool

Open simonw opened this issue 2 years ago • 5 comments

I forgot to add equivalents of extras_key= and ignore_extras= to the CLI tool - will do that in a separate issue.

Originally posted by @simonw in https://github.com/simonw/sqlite-utils/issues/440#issuecomment-1155767915

simonw avatar Jun 14 '22 22:06 simonw

I think these become part of the _import_options list which is used in a few places:

https://github.com/simonw/sqlite-utils/blob/b8af3b96f5c72317cc8783dc296a94f6719987d9/sqlite_utils/cli.py#L765-L800

simonw avatar Jun 14 '22 23:06 simonw

I'm going with --extras-key and --ignore-extras as the two new options.

simonw avatar Jun 14 '22 23:06 simonw

This is tricky to implement because of this code: https://github.com/simonw/sqlite-utils/blob/b8af3b96f5c72317cc8783dc296a94f6719987d9/sqlite_utils/cli.py#L938-L945

It's reconstructing each document using the known headers here:

docs = (dict(zip(headers, row)) for row in reader)

So my first attempt at this - the diff here - did not have the desired result:

diff --git a/sqlite_utils/cli.py b/sqlite_utils/cli.py
index 86eddfb..00b920b 100644
--- a/sqlite_utils/cli.py
+++ b/sqlite_utils/cli.py
@@ -6,7 +6,7 @@ import hashlib
 import pathlib
 import sqlite_utils
 from sqlite_utils.db import AlterError, BadMultiValues, DescIndex
-from sqlite_utils.utils import maximize_csv_field_size_limit
+from sqlite_utils.utils import maximize_csv_field_size_limit, _extra_key_strategy
 from sqlite_utils import recipes
 import textwrap
 import inspect
@@ -797,6 +797,15 @@ _import_options = (
         "--encoding",
         help="Character encoding for input, defaults to utf-8",
     ),
+    click.option(
+        "--ignore-extras",
+        is_flag=True,
+        help="If a CSV line has more than the expected number of values, ignore the extras",
+    ),
+    click.option(
+        "--extras-key",
+        help="If a CSV line has more than the expected number of values put them in a list in this column",
+    ),
 )
 
 
@@ -885,6 +894,8 @@ def insert_upsert_implementation(
     sniff,
     no_headers,
     encoding,
+    ignore_extras,
+    extras_key,
     batch_size,
     alter,
     upsert,
@@ -909,6 +920,10 @@ def insert_upsert_implementation(
         raise click.ClickException("--flatten cannot be used with --csv or --tsv")
     if encoding and not (csv or tsv):
         raise click.ClickException("--encoding must be used with --csv or --tsv")
+    if ignore_extras and extras_key:
+        raise click.ClickException(
+            "--ignore-extras and --extras-key cannot be used together"
+        )
     if pk and len(pk) == 1:
         pk = pk[0]
     encoding = encoding or "utf-8-sig"
@@ -935,7 +950,9 @@ def insert_upsert_implementation(
                 csv_reader_args["delimiter"] = delimiter
             if quotechar:
                 csv_reader_args["quotechar"] = quotechar
-            reader = csv_std.reader(decoded, **csv_reader_args)
+            reader = _extra_key_strategy(
+                csv_std.reader(decoded, **csv_reader_args), ignore_extras, extras_key
+            )
             first_row = next(reader)
             if no_headers:
                 headers = ["untitled_{}".format(i + 1) for i in range(len(first_row))]
@@ -1101,6 +1118,8 @@ def insert(
     sniff,
     no_headers,
     encoding,
+    ignore_extras,
+    extras_key,
     batch_size,
     alter,
     detect_types,
@@ -1176,6 +1195,8 @@ def insert(
             sniff,
             no_headers,
             encoding,
+            ignore_extras,
+            extras_key,
             batch_size,
             alter=alter,
             upsert=False,
@@ -1214,6 +1235,8 @@ def upsert(
     sniff,
     no_headers,
     encoding,
+    ignore_extras,
+    extras_key,
     alter,
     not_null,
     default,
@@ -1254,6 +1277,8 @@ def upsert(
             sniff,
             no_headers,
             encoding,
+            ignore_extras,
+            extras_key,
             batch_size,
             alter=alter,
             upsert=True,
@@ -1297,6 +1322,8 @@ def bulk(
     sniff,
     no_headers,
     encoding,
+    ignore_extras,
+    extras_key,
     load_extension,
 ):
     """
@@ -1331,6 +1358,8 @@ def bulk(
             sniff=sniff,
             no_headers=no_headers,
             encoding=encoding,
+            ignore_extras=ignore_extras,
+            extras_key=extras_key,
             batch_size=batch_size,
             alter=False,
             upsert=False,

simonw avatar Jun 14 '22 23:06 simonw

Yeah my initial implementation there makes no sense:

            csv_reader_args = {"dialect": dialect}
            if delimiter:
                csv_reader_args["delimiter"] = delimiter
            if quotechar:
                csv_reader_args["quotechar"] = quotechar
            reader = _extra_key_strategy(
                csv_std.reader(decoded, **csv_reader_args), ignore_extras, extras_key
            )
            first_row = next(reader)
            if no_headers:
                headers = ["untitled_{}".format(i + 1) for i in range(len(first_row))]
                reader = itertools.chain([first_row], reader)
            else:
                headers = first_row
            docs = (dict(zip(headers, row)) for row in reader)

Because my _extra_key_strategy() helper function is designed to work against csv.DictReader - not against csv.reader() which returns a sequence of lists, not a sequence of dictionaries.

In fact, what's happening here is that dict(zip(headers, row)) is ignoring anything in the row that doesn't correspond to a header:

>>> list(zip(["a", "b"], [1, 2, 3]))
[('a', 1), ('b', 2)]

simonw avatar Jun 14 '22 23:06 simonw

I'm going to push a branch with my not-yet-working code (which does at least include a test).

simonw avatar Jun 15 '22 04:06 simonw