visidata
visidata copied to clipboard
Add options + code that allows a user to specify the format of expanded column names [pyobj]
Proposed Change / Requirement:
- Allow the user to specify the format string used to name expanded python object columns of type
list
anddict
, e.g., "%s[%s]" - Allow the user to specify lists as being 0- or 1-based, e.g. a[0] vs a[1]
Currently the formats used to generate the name of an expanded column are hardcoded in pyobj.py
.
It is proposed to allow the user to set them programmatically via means of 2 visidata options.
Further, in the case of lists, (and only affecting the name), it is proposed to allow the indexing of flattened lists/tuples to optionally start at 1, (c.f., the current value: 0). Again, with a visidata option.
Why? Good Question, Thanks For Asking! :)
This is to allow better interoperability with external programs when converting between csv-json, and vice versa.
In this particular case with mlr
, which serializes json/csv using 1-based arrays and with format "%s.%s".
Proof of Concept:
Initial Notes:
Not sure if this is the way to do it; i.e., not sure if it should probably be a plugin, etc. Also, it doesn't have error checking on the format string, so can easily throw an error, .e.g., "TypeError: not enough arguments for format string"
It is however, Hideously Functional⁽™⁾ :)
We introduce 3 new options:
- expanded_column_format_dict # str
- expanded_column_format_list # str
- expanded_column_list_1up # bool
more specifically:
vd.option('expanded_column_format_dict', '%s.%s', 'column name given to expanded columns (dictionaries)')
vd.option('expanded_column_format_list', '%s[%s]', 'column name given to expanded columns (lists and tuples)')
vd.option('expanded_column_list_1up', False, 'expanded columns become 1-based arrays for lists and tuples, when this option is True')
These default to the current hardcoded values, and allows them of course to be set in .visidatarc
Currently vd would expand the input: { x: [ "a", "b", "c" ] }
to:
x[0],x[1],x[2]
a,b,c
With the proposed changes (and example settings), we can produce:
x.1,x.2,x.3
a,b,c
Which allows us to take our flattened-json in csv format, and recover the structure using mlr, i.e., with column expansion in vd, the following pipeline is idempotent to its input:
echo '{ "x": [ "a", "b", "c" ] }' | vd -f json --save-filetype=csv | mlr --c2j --no-jlistwrap cat
I've attached the would be pull-request as combined unified diff, and
the local branch I've been working on was forked from branch: origin/develop@554ebdf5
Regardless, if this isn't possible / suitable, etc..I would be equally happy to hear there's an alternative in existence, or there's a better way to achieve the same thing, etc.
Cheers,
Joe.
Hey @joe-opensrc, thanks for this request. I'm wondering if we used '%s.%s'
for both lists and dicts internally, if anyone would lament the loss of the separate "col[0]" notation, or have any desire for configurability beyond this. Do you care for configurability yourself or just want it compatible with e.g. mlr?
I think we could have a general internal option called something like array_base
which could be 0 or 1 (or otherwise I guess) and would affect this and other places that an array index is exposed to the user (melt, for example).
hello @saulpw,
thank you for the reply :)
tl;dr
Personally ok with existing notation. Prefer configurability; flexible with external JSON/CSV tools. Agree with
array_base
; better name!
longer response:
Personally I quite like the visidata notation of a[n]
and a.x
; less ambiguous as to the underlying type.
I think I would lament the loss! :D
re configurability;
I think I prefer configurability.
I'm using visidata (along with a few other tools), to help wrangle CSV to/from JSON,
and for my use-case I think the end-goal would be to have the ability to collapse columns via a parser/tokeniser for those sheet types. i.e., using a method that doesn't rely on having the origCol
from a previous expansion.
Something which would allow bi-directional parsing of CSV+notation to/from JSON, e.g.,:
- open csv file
- parse / unflatten columns in CSV+notation into JSON / pyobj (using configurable options)
- wrangle data
- expand / flatten columns into CSV+notation (using configurable options)
- save csv file
I'm currently using mlr
to achieve 2., and opening as JSON.
This suggestion is basically no. 4.
Although 2. is a non-trivial problem, for sure, it could perhaps lend itself to being a plugin(?)
One which might provide alternatives for the vd expand-col*
commands, and maybe leverage an external library in the process, etc.
re array_base
;
..is totally a better name :). How about: ecol_list_format ? ecol_dict_format ?
Additional:
Another example of a JSON serialization tool is gron
:
echo '{ "x": [ "a", "b", "c" ], "y": { "z": "###" } }' | gron
json = {};
json.x = [];
json.x[0] = "a";
json.x[1] = "b";
json.x[2] = "c";
json.y = {};
json.y.z = "###";
Which you can see is closer to visidata notation.
To illustrate, the real-world reasoning / problem goes like this:
A department MUST use Excel/CSV to provide data to a pipeline (boo!)
The data MUST later be transformed to/from JSON
The data MIGHT later be transformed back to CSV/Excel
CSV is flat / has no structure
JSON is not-flat / is structured.
The aim is to preserve the structure of the data using the CSV column names (CSV+notation).
visidata cannot parse CSV+notation into JSON
external tools can parse CSV+notation in JSON, but use different formats i.e., a[0]
vs a.0
vs a.1
, etc...
therefore make visidata capable of producing column names which conform to external tool notation
eventually make visidata parse CSV+notation internally in production of its column names; via a loader plugin.
Or something like that! :D
Sorry for the essay, hope this is useful,
Hi @joe-opensrc, I believe this is implemented more or less how you wanted. Let me know if this works for you!
Hi @saulpw,
excellent, thank you! ...and apologies for not getting back sooner.
I tested branch v.2.10dev
(via pip install -U git+file ...
) and it worked as intended :)
I just updated to tag:v2.10
though, and I have a suspicion that the array_base
setting might have now disappeared from the repo?
Re-opened to investigate this! Thanks for letting us know.
@joe-opensrc Was array_base
part of v2.10dev
? I cannot find it ever existing! It may have been missed.
I can add it if it was missing. It is just important to clarify if it was forgotten, or kept out/removed for a reason. =)
@anjakefala Thanks for looking; a bit confused here also :D
Possibly it was missed?
The python-lib METADATA file contains: Download-URL: https://github.com/saulpw/visidata/tarball/2.10dev
So I must have installed from there at somepoint?
There is options.incr_base
but it is used for the incr commands, and not for this. We opted for going with the fmtstrs as per the original wish. No options.array_base
exists.