Issue

There are two differences between data.get_column_view(x)[0] and data.transform(Domain([x]).X[:, 0].

The former is simple and short.
The latter works.

If using Text to Columns used get_column_view, Apply Domain could not apply the second transformation because the required attribute (discretized petal length) is not in the domain.

Using transform works because it invokes compute_value.

Description of changes

Add Table.get_column which should be preferred over Table.get_column_view, and must be used in places where the variable is not guaranteed to exist (in particular, in compute_value functions).

If the column exists, it calls get_column_view, otherwise it uses compute_value.

Includes

[X] Code changes
[X] Tests
[X] Documentation

Jul 12 '22 10:07 janezd

As you say, using get_column_view on an unknown table is wrong.

For me, this is a confusing mixture of two very different behaviors. It implies a fast operation. So I do not like the name. I do not have any better ideas though.

Jul 15 '22 09:07 markotoplak

Would it be possible to have something similar to pandas' data["sepal length"]? This would not require to name the method, but it would likely require too much messing with the existing Table structure.

Jul 15 '22 09:07 ajdapretnar

@ajdapretnar, that would be more messing with the data structure. :) Especially in this case because the actual variable (with its compute value) is important, not the name.

Now I think get_column is fine, especially since then programmers who do not completely understand what is going on will find it first. The function you wrote is still as fast as get_column_view when the latter would be appropriate, so I see no loss.

I'd deprecate get_column_view.

Slightly related: do we ever enforce what shape functions doing compute_value return? I remember we have some code that reshapes compute_value outputs within table transforms, but that might be something stale? Is the shape of the output array always the same?

Jul 15 '22 10:07 markotoplak

I checked what should the compute_value do. We never enforce what its output should look like. Instead, match_density assures it is a (1D array for dense and 2D for sparse) after calling it in _ArrayConversion.get_columns. So I was running tests with assert statements:

@@ -299,10 +299,14 @@ class _ArrayConversion:
                     if shared is None:
                         shared = col.compute_shared(sourceri)
                         _idcache_save(shared_cache, (col.compute_shared, source), shared)
+                    ttt = col(sourceri, shared_data=shared)
                     col_array = match_density(
-                        col(sourceri, shared_data=shared))
+                            ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
                 else:
-                    col_array = match_density(col(sourceri))
+                    ttt = col(sourceri)
+                    col_array = match_density(ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
             elif col < 0:

For dense arrays, compute_value usually returns 1D numpy arrays. There are exceptions. PCA tests fail because there is a column array at one spot, then there are problems with t-SNE, and for the Feature Constructor, compute_value can be also a constant or even a python list.

Slicing on sparse matrices, on the other hand, always returns something that is 2D, because they do not have 1D slices at all.

For now, the output shape of this get_columns is thus undefined. :) Should we reshape it somehow? Also, if we want the same output shape for sparse and dense matrices, a (rows, 1) shaped vector is currently the only way to go. What do you think, @janezd?

Jul 15 '22 12:07 markotoplak

You're right, the result of compute_value should be reshaped. compute_value is currently called in domain conversion, which (for dense arrays) goes through assure_column_dense, which calls reshape(-1).

What about sparse data?

We could also request compute_value to return either a vector or a sparse matrix. For a few years, returning anything else would trigger a warning.

I'd deprecate get_column_view.

Me too, but its everywhere and we'll never remove it. We can just recommend not using it.

One argument is that we (in particular: I) use(d) it where we shouldn't.
Another: get_column_view does not even always return a view! In '19, that @janezd guy added code that maps between different ordering of categorical values!

There may be cases when one needs a view, e.g. because you just created and unlocked a table. As current function does not really always give a view, but it has a good name that we can't reuse: the function get_column that I propose here has an argument copy that, if set to True, forces a copy. We could add view=False; if set to True, it would return a view and fail if it can't.

(Setting both view and copy to True would always return a sparse matrix, because the joker who calls it this way deserves it. 🤪)

Jul 16 '22 09:07 janezd

Codecov Report

Merging #6058 (3a549a2) into master (036e96d) will increase coverage by 0.00%. The diff coverage is 94.59%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6058   +/-   ##
=======================================
  Coverage   86.65%   86.65%           
=======================================
  Files         315      315           
  Lines       67957    67974   +17     
=======================================
+ Hits        58890    58905   +15     
- Misses       9067     9069    +2

Oct 20 '22 12:10 codecov[bot]

@janezd, thanks for this. All the new usages look much simpler, especially when the values needed to be set. Merging as soon as the tests pass.

Oct 24 '22 11:10 markotoplak

Table: Add method get_column

Issue

Description of changes

Includes

Codecov Report