grass icon indicating copy to clipboard operation
grass copied to clipboard

v.edit: Implement the batch option for batch editing

Open HuidaeCho opened this issue 1 year ago • 19 comments

This PR implements the new batch option for batch editing of a lot of features. For example, without this option, we need this script to extract the upper halves of roadsmajor from the NC sample dataset,

# in nc_spm_08_grass7/user1

# copy roadsmajor
g.copy vect=roadsmajor,roadsmajor_upper_half_wo_batch

# find road centers
v.db.select map=roadsmajor -c col=cat |
        awk '{printf "P %d %d 50%%\n", $1, $1}' |
        v.segment roadsmajor output=roadsmajor_center

# break roads at their center
v.to.db -p map=roadsmajor_center option=coor sep=comma |
        sed '/^cat/d' |
        while read cat x y; do
        v.edit map=roadsmajor_upper_half_wo_batch tool=break cats=$cat coords=$x,$y
done

# delete lower halves
v.to.db -p map=roadsmajor option=end sep=comma |
        sed '/^cat/d' |
        while read cat x y; do
        v.edit map=roadsmajor_upper_half_wo_batch tool=delete cats=$cat coords=$x,$y
done

It took 11.938 seconds on i9-12900. On the same CPU, with the new option, it took 2.180 seconds:

# in nc_spm_08_grass7/user1

# copy roadsmajor
g.copy vect=roadsmajor,roadsmajor_upper_half_w_batch

# find road centers
v.db.select map=roadsmajor -c col=cat |
        awk '{printf "P %d %d 50%%\n", $1, $1}' |
        v.segment roadsmajor output=roadsmajor_center --o

# create a batch table and pipe it to v.edit
(
# use these batch columns
echo "tool|cats|coords"

# break roads at their center
v.to.db -p map=roadsmajor_center option=coor sep=space |
        awk '!/^cat/{printf "break|%d|%s,%s\n", $1, $2, $3}'

# delete lower halves
v.to.db -p map=roadsmajor option=end sep=space |
        awk '!/^cat/{printf "delete|%d|%s,%s\n", $1, $2, $3}'

# the above three commands produce the following table:
# tool|cats|coords
# break|1|637108.11963224,257241.783755701
# break|2|636792.27198317,254339.365766968
# break|3|638449.758677739,248299.357582029
# ...
# delete|1|637209.083058167,257970.129540226
# delete|2|636981.336042672,256517.602235172
# delete|3|637960.264160529,248293.868427705
# ...
) | v.edit map=roadsmajor_upper_half_w_batch batch=-

Red is the upper halves. image

HuidaeCho avatar Feb 24 '24 20:02 HuidaeCho

This is for a separate PR?

https://github.com/OSGeo/grass/blob/87d2d42d3b26c01d6de08ae8845618c9918a20cc/vector/v.edit/main.c#L473-L478

I think we need && ret > 0 check to avoid building topology unnecessarily when there were no edits.

HuidaeCho avatar Feb 25 '24 15:02 HuidaeCho

First of all, this is pretty cool and the benchmark makes it quite convincing that this feature is needed.

About format: Have you tried how the input format works in Python? I see it is basically a CSV with pipe, standard in GRASS GIS, but rare outside of it. In Bash/sed/awk, the any custom format will usually do just as well as anything standardized, but in Python, while one can do the same series of text manipulation hacks as with the Bash-sed-awk combo, learning and implementing a writer for each new tool creates more code to maintain for the user and it is more error-prone and thus it makes the API harder to use[^1]. With standardizing the outputs of other tools like v.db.select which is now outputting JSON and true CSV, does it make sense to you here to consider CSV with comma as an applicable separator or, considering #3028, JSON input?

[^1]: In general, every time if you see CSV or JSON input as a Python coder, you can just use csv or json packages to generate it (getting the CSV dialect right and possibly checking the JSON special cases). If you go the route of generating the text yourself, it may be very simple for some cases ",".join(my_list_of_numbers) but you need to consider every time if any of the all possible the edge cases applies to you (newlines, separators in values, ...). I don't think in this case, you can't make it work with a custom reader-writer, but this will be used with other tools, so it seems to me easier to just tell people this is JSON with such and such structure, then here is a format somewhat similar to CSV in some ways but different in other ways which you need to figure out from examples and judge for yourself if you can use a standard writer or you have to implement something custom.

wenzeslaus avatar Feb 25 '24 18:02 wenzeslaus

About format: Have you tried how the input format works in Python? I see it is basically a CSV with pipe, standard in GRASS GIS, but rare outside of it. In Bash/sed/awk, the any custom format will usually do just as well as anything standardized, but in Python, while one can do the same series of text manipulation hacks as with the Bash-sed-awk combo, learning and implementing a writer for each new tool creates more code to maintain for the user and it is more error-prone and thus it makes the API harder to use1. With standardizing the outputs of other tools like v.db.select which is now outputting JSON and true CSV, does it make sense to you here to consider CSV with comma as an applicable separator or, considering #3028, JSON input?

I thought about different formats (e.g., CSV). The input uses commas for ids, cats, and coords separators and that's why I chose the pipe because of its easier implementation (and because it's the GRASS standard), and the pipe seems to be rarer in any other strings as well like in where (except... | used for bitwise-OR, so it actually needs to be escaped somehow). Also, it would be nice to have the standard separator option, which can be longer than one character. However, then we'll have to use the slower strstr() instead of strchr() even for a single-character separator. The focus of this implementation was performance because one of my vector edit task involving 38k features was expected to take 12 days. Now, it's extrapolated to about 4 days (still too slow :-(, but at least automated) with this option.

I think, support for different formats should be done outside individual modules to not affect their analysis performance. For example,

module_outputs_json | g.format input=json output=pipe | module_takes_pipe

or similar in Python. This way, we don't have to worry about supporting different formats separately in each module and the burden for that is eliminated from it. Even handling different formats can slow down the actual task of the module.

Also, JSON streaming is something we need to consider for large data streaming. e.g., one edit (object) per line. Does JSON have the standard format for this? If this is not somehow supported, JSON data will need to be generated completely before we can start editing any features. Another consideration is its data size (almost twice 556 bytes vs. 270 bytes for the pipe format in the following example).

[
{"tool": "break", "cats": [1], "coords": [{"x": 637108.11963224, "y": 257241.783755701}]},
{"tool": "break", "cats": [2], "coords": [{"x": 636792.27198317, "y": 254339.365766968}]},
{"tool": "break", "cats": [3], "coords": [{"x": 638449.758677739, "y": 248299.357582029}]},
{"tool": "delete", "cats": [1], "coords": [{"x": 637209.083058167, "y": 257970.129540226}]},
{"tool": "delete", "cats": [2], "coords": [{"x": 636981.336042672, "y": 256517.602235172}]},
{"tool": "delete", "cats": [2], "coords": [{"x": 637960.264160529, "y": 248293.868427705}]}
]

NOT

[
    {
        "tool": "break",
        "cats": [ 
            1
        ],
        "coords": [
            {
                "x": 637108.11963224,
                "y": 257241.783755701
            }
        ]
    },
    {
        "tool": "break",
        "cats": [
            2
        ],
        "coords": [
            {
                "x": 636792.27198317,
                "y": 254339.365766968
            }
        ]
    },
    {
        "tool": "break",
        "cats": [
            3
        ],
        "coords": [
            {
                "x": 638449.758677739,
                "y": 248299.357582029
            }
        ]
    },
    {
        "tool": "delete",
        "cats": [
            1
        ],
        "coords": [
            {
                "x": 637209.083058167,
                "y": 257970.129540226
            }
        ]
    },
    {
        "tool": "delete",
        "cats": [
            2
        ],
        "coords": [
            {
                "x": 636981.336042672,
                "y": 256517.602235172
            }
        ]
    },
    {
        "tool": "delete",
        "cats": [
            2
        ],
        "coords": [
            {
                "x": 637960.264160529,
                "y": 248293.868427705
            }
        ]
    }
]

The pipe format is

tool|cats|coords
break|1|637108.11963224,257241.783755701
break|2|636792.27198317,254339.365766968
break|3|638449.758677739,248299.357582029
delete|1|637209.083058167,257970.129540226
delete|2|636981.336042672,256517.602235172
delete|3|637960.264160529,248293.868427705

HuidaeCho avatar Feb 25 '24 22:02 HuidaeCho

Also, JSON streaming is something we need to consider for large data streaming. e.g., one edit (object) per line. Does JSON have the standard format for this? If this is not somehow supported, JSON data will need to be generated completely before we can start editing any features. Another consideration is its data size (almost twice 556 bytes vs. 270 bytes for the pipe format in the following example).

Go read about .jsonl, json line. It's quite useful and I think it's what you're looking for. It's like CSV, where there's one record per line, but each line is a separate JSON. So you can append to that file. I used that a couple years ago. And I was able to analyze with power BI/excel get&transform. Some other software already works with it. Other json format extensions I've heard of is json5 and jsonc, which you probably might happened to interact with if you already used vscode and their settings.

echoix avatar Feb 25 '24 23:02 echoix

...The input uses commas for ids, cats, and coords separators and that's why I chose the pipe because of its easier implementation (and because it's the GRASS standard), and the pipe seems to be rarer in any other strings as well like in where (except... | used for bitwise-OR, so it actually needs to be escaped somehow).

That's kind of the number one issue of the pipe-separated CSV format: It is rare enough except when it is not. Seems rare in text, except when some thinks it will be nice table-like separator, and of course simply present in many different languages. Things like v.db.select caused a lot of trouble when trying to workaround need for proper format by "smart" separators. There, I think, it turned out that the rare-character separators nor the multi-character separators actually deliver in general cases and proper CSV with quoting (whatever the dialect) is the way to go.

I'm not necessary arguing for JSON over CSV, but I'm definitely for a well-specified format. A (stricter) subset of CSV sounds good. I just want to avoid that mistake of original v.db.select where the idea was that pipe as a separator will avoid need for a well-defined format.

The original v.db.select tried to work around the limits by some escapes, but that did not scale and was hard to predict for the (API) user. The nesting of the lists sort of sounds like that, so I want to be sure we don't cause unnecessary trouble for wider adoption of this down the line.

wenzeslaus avatar Feb 25 '24 23:02 wenzeslaus

I'm not necessary arguing for JSON over CSV, but I'm definitely for a well-specified format. A (stricter) subset of CSV sounds good. I just want to avoid that mistake of original v.db.select where the idea was that pipe as a separator will avoid need for a well-defined format.

The original v.db.select tried to work around the limits by some escapes, but that did not scale and was hard to predict for the (API) user. The nesting of the lists sort of sounds like that, so I want to be sure we don't cause unnecessary trouble for wider adoption of this down the line.

I think we want to implement https://www.rfc-editor.org/rfc/rfc4180 with other single-special characters as a valid separator.

HuidaeCho avatar Feb 26 '24 00:02 HuidaeCho

I think we want to implement https://www.rfc-editor.org/rfc/rfc4180 with other single-special characters as a valid separator.

Done. It should support:

tool,cats,coords
break,1,"637108.11963224,257241.783755701"
break,2,"636792.27198317,254339.365766968"
break,3,"638449.758677739,248299.357582029"
delete,1,"637209.083058167,257970.129540226"
delete,2,"636981.336042672,256517.602235172"
delete,3,"637960.264160529,248293.868427705"

and

tool|cats|coords|where
break|1|637108.11963224,257241.783755701|"col='some ""crazy"" | text'"
break|2|636792.27198317,254339.365766968|"col='some ""crazy"" | text'"
break|3|638449.758677739,248299.357582029|"col='some ""crazy"" | text'"
delete|1|637209.083058167,257970.129540226|"col='some ""crazy"" | text'"
delete|2|636981.336042672,256517.602235172|"col='some ""crazy"" | text'"
delete|3|637960.264160529,248293.868427705|"col='some ""crazy"" | text'"

HuidaeCho avatar Feb 26 '24 02:02 HuidaeCho

+1 Thanks for implementing that. https://www.rfc-editor.org/rfc/rfc4180 or https://www.rfc-editor.org/rfc/rfc7111 is what I used as a reference for CSV in v.db.select (while ignoring the CRLF-LF distinction).

wenzeslaus avatar Feb 26 '24 03:02 wenzeslaus

@echoix I saw jsonl at some point. I did I think it is quite relevant, but I haven't used it yet.

wenzeslaus avatar Feb 26 '24 03:02 wenzeslaus

@HuidaeCho I tested the new separator parameter, everything works as expected. I just have little comment on running v.edit on a file with different separator than used in input file:

head -n2 /tmp/batch 
tool,cats,coords
break,1,"637108.11963224,257241.783755701"

v.edit map=roadsmajor_upper_half batch=/tmp/batch 
ERROR: Unknown batch column 'tool,cats,coords' in line 1

v.edit map=roadsmajor_upper_half batch=/tmp/batch sep=,
ERROR: Unable to open vector map <roadsmajor_upper_half> on topological
       level. Try to rebuild vector topology by v.build.

landam avatar Feb 26 '24 06:02 landam

@HuidaeCho I tested the new separator parameter, everything works as expected. I just have little comment on running v.edit on a file with different separator than used in input file:

head -n2 /tmp/batch 
tool,cats,coords
break,1,"637108.11963224,257241.783755701"

v.edit map=roadsmajor_upper_half batch=/tmp/batch 
ERROR: Unknown batch column 'tool,cats,coords' in line 1

v.edit map=roadsmajor_upper_half batch=/tmp/batch sep=,
ERROR: Unable to open vector map <roadsmajor_upper_half> on topological
       level. Try to rebuild vector topology by v.build.

Please try it again.

HuidaeCho avatar Feb 26 '24 07:02 HuidaeCho

This is for a separate PR?

https://github.com/OSGeo/grass/blob/87d2d42d3b26c01d6de08ae8845618c9918a20cc/vector/v.edit/main.c#L473-L478

I think we need && ret > 0 check to avoid building topology unnecessarily when there were no edits.

I was wrong. Level 2 vector always requires topology building to remain level 2 even if there are no changes. Just tried this:

Vect_open_update2();
/* do nothing */
Vect_close();

Level 2 vector became level 1. I'm not sure why and how...

HuidaeCho avatar Feb 26 '24 08:02 HuidaeCho

I was wrong. Level 2 vector always requires topology building to remain level 2 even if there are no changes. Just tried this:

Vect_open_update2();
/* do nothing */
Vect_close();

Level 2 vector became level 1. I'm not sure why and how...

Please try #3459. If this works for you, you are welcome to include this single line in your PR and I close mine without merging.

metzm avatar Feb 26 '24 15:02 metzm

Also, JSON streaming is something we need to consider for large data streaming. e.g., one edit (object) per line. Does JSON have the standard format for this? If this is not somehow supported, JSON data will need to be generated completely before we can start editing any features. Another consideration is its data size (almost twice 556 bytes vs. 270 bytes for the pipe format in the following example).

Go read about .jsonl, json line. It's quite useful and I think it's what you're looking for. It's like CSV, where there's one record per line, but each line is a separate JSON. So you can append to that file. I used that a couple years ago. And I was able to analyze with power BI/excel get&transform. Some other software already works with it. Other json format extensions I've heard of is json5 and jsonc, which you probably might happened to interact with if you already used vscode and their settings.

Now with support for multiple batch tables in one input, I don't see a strong need for JSON implementation in the module. I still think multiple formats should be supported more generally outside individual modules. The new option at least supports a subset[^1] of a well-documented version of CSV and Python users should be able to use the csv module.

[^1]: Not supporting multiline columns because of its unpredictable length, but might be useful for the add tool to embed ASCII input; I think one edit per line should be better?

HuidaeCho avatar Feb 26 '24 16:02 HuidaeCho

I was wrong. Level 2 vector always requires topology building to remain level 2 even if there are no changes. Just tried this:

Vect_open_update2();
/* do nothing */
Vect_close();

Level 2 vector became level 1. I'm not sure why and how...

Please try #3459. If this works for you, you are welcome to include this single line in your PR and I close mine without merging.

Thanks for the PR. With that change, Vect_close() did the job, but G_fatal_error() still downgrades the level to 1 without an error handler (minor annoyance ;-).

HuidaeCho avatar Feb 26 '24 16:02 HuidaeCho

Thanks for the PR. With that change, Vect_close() did the job, but G_fatal_error() still downgrades the level to 1 without an error handler (minor annoyance ;-).

Should be fixed with the latest commit to #3459. In case of a crash, the original topo info is (should be) preserved.

metzm avatar Feb 26 '24 17:02 metzm

Can you please add a test? Both grass.gunittest and pytest would work here. pytest is now little easier than before because grass.script.create_location() does not require an existing session. grass.gunittest would be needed if you want to turn your example into a test. Python doc is quite good for CSV writing, but here is an example of CSV writer from a Python dictionary I recently wrote.

wenzeslaus avatar Feb 26 '24 17:02 wenzeslaus

Can you please add a test? Both grass.gunittest and pytest would work here. pytest is now little easier than before because grass.script.create_location() does not require an existing session. grass.gunittest would be needed if you want to turn your example into a test. Python doc is quite good for CSV writing, but here is an example of CSV writer from a Python dictionary I recently wrote.

I'll do that when I get a chance.

HuidaeCho avatar Feb 26 '24 20:02 HuidaeCho

head -n2 /tmp/batch 
tool,cats,coords
break,1,"637108.11963224,257241.783755701"

v.edit map=roadsmajor_upper_half batch=/tmp/batch 
ERROR: Unknown batch column 'tool,cats,coords' in line 1

v.edit map=roadsmajor_upper_half batch=/tmp/batch sep=,
ERROR: Unable to open vector map <roadsmajor_upper_half> on topological
       level. Try to rebuild vector topology by v.build.

Please try it again.

Tested successfully.

landam avatar Mar 02 '24 23:03 landam

@HuidaeCho Can we manage for 8.4.0 or will be postpone for 8.5? It is a new feature so I am not sure if it's a good idea to introduce it in 8.4.1 (?)

landam avatar Jun 03 '24 09:06 landam

This PR needs to wait for 8.5. It is failing existing unrelated tests which just happen to use gs.run_command("v.edit", map=name, tool="create").

This also reveals the issue that there are no tests for the current functionality. These can be added in a separate PR.

wenzeslaus avatar Jun 10 '24 12:06 wenzeslaus