orange3
orange3 copied to clipboard
Table: Add method get_column
Issue
There are two differences between data.get_column_view(x)[0]
and data.transform(Domain([x]).X[:, 0]
.
- The former is simple and short.
- The latter works.

If using Text to Columns used get_column_view
, Apply Domain could not apply the second transformation because the required attribute (discretized petal length) is not in the domain.
Using transform
works because it invokes compute_value
.
Description of changes
Add Table.get_column
which should be preferred over Table.get_column_view
, and must be used in places where the variable is not guaranteed to exist (in particular, in compute_value
functions).
If the column exists, it calls get_column_view
, otherwise it uses compute_value
.
Includes
- [X] Code changes
- [X] Tests
- [X] Documentation
As you say, using get_column_view
on an unknown table is wrong.
For me, this is a confusing mixture of two very different behaviors. It implies a fast operation. So I do not like the name. I do not have any better ideas though.
Would it be possible to have something similar to pandas' data["sepal length"]
? This would not require to name the method, but it would likely require too much messing with the existing Table structure.
@ajdapretnar, that would be more messing with the data structure. :) Especially in this case because the actual variable (with its compute value) is important, not the name.
Now I think get_column
is fine, especially since then programmers who do not completely understand what is going on will find it first. The function you wrote is still as fast as get_column_view
when the latter would be appropriate, so I see no loss.
I'd deprecate get_column_view
.
Slightly related: do we ever enforce what shape functions doing compute_value
return? I remember we have some code that reshapes compute_value
outputs within table transforms, but that might be something stale? Is the shape of the output array always the same?
I checked what should the compute_value
do. We never enforce what its output should look like. Instead, match_density
assures it is a (1D array for dense and 2D for sparse) after calling it in _ArrayConversion.get_columns
. So I was running tests with assert statements:
@@ -299,10 +299,14 @@ class _ArrayConversion:
if shared is None:
shared = col.compute_shared(sourceri)
_idcache_save(shared_cache, (col.compute_shared, source), shared)
+ ttt = col(sourceri, shared_data=shared)
col_array = match_density(
- col(sourceri, shared_data=shared))
+ ttt)
+ assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
else:
- col_array = match_density(col(sourceri))
+ ttt = col(sourceri)
+ col_array = match_density(ttt)
+ assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
elif col < 0:
For dense arrays, compute_value
usually returns 1D numpy arrays. There are exceptions. PCA tests fail because there is a column array at one spot, then there are problems with t-SNE, and for the Feature Constructor, compute_value can be also a constant or even a python list.
Slicing on sparse matrices, on the other hand, always returns something that is 2D, because they do not have 1D slices at all.
For now, the output shape of this get_columns
is thus undefined. :) Should we reshape it somehow? Also, if we want the same output shape for sparse and dense matrices, a (rows, 1) shaped vector is currently the only way to go. What do you think, @janezd?
You're right, the result of compute_value
should be reshaped. compute_value
is currently called in domain conversion, which (for dense arrays) goes through assure_column_dense
, which calls reshape(-1)
.
What about sparse data?
We could also request compute_value
to return either a vector or a sparse matrix. For a few years, returning anything else would trigger a warning.
I'd deprecate
get_column_view
.
Me too, but its everywhere and we'll never remove it. We can just recommend not using it.
- One argument is that we (in particular: I) use(d) it where we shouldn't.
- Another:
get_column_view
does not even always return a view! In '19, that @janezd guy added code that maps between different ordering of categorical values!
There may be cases when one needs a view, e.g. because you just created and unlocked a table. As current function does not really always give a view, but it has a good name that we can't reuse: the function get_column
that I propose here has an argument copy
that, if set to True
, forces a copy. We could add view=False
; if set to True
, it would return a view and fail if it can't.
(Setting both view
and copy
to True
would always return a sparse matrix, because the joker who calls it this way deserves it. 🤪)
Codecov Report
Merging #6058 (3a549a2) into master (036e96d) will increase coverage by
0.00%
. The diff coverage is94.59%
.
Additional details and impacted files
@@ Coverage Diff @@
## master #6058 +/- ##
=======================================
Coverage 86.65% 86.65%
=======================================
Files 315 315
Lines 67957 67974 +17
=======================================
+ Hits 58890 58905 +15
- Misses 9067 9069 +2
@janezd, thanks for this. All the new usages look much simpler, especially when the values needed to be set. Merging as soon as the tests pass.