orange3 icon indicating copy to clipboard operation
orange3 copied to clipboard

Table: Add method get_column

Open janezd opened this issue 2 years ago • 5 comments

Issue

There are two differences between data.get_column_view(x)[0] and data.transform(Domain([x]).X[:, 0].

  1. The former is simple and short.
  2. The latter works.
Screen Shot 2022-07-12 at 11 55 39

If using Text to Columns used get_column_view, Apply Domain could not apply the second transformation because the required attribute (discretized petal length) is not in the domain.

Using transform works because it invokes compute_value.

Description of changes

Add Table.get_column which should be preferred over Table.get_column_view, and must be used in places where the variable is not guaranteed to exist (in particular, in compute_value functions).

If the column exists, it calls get_column_view, otherwise it uses compute_value.

Includes
  • [X] Code changes
  • [X] Tests
  • [X] Documentation

janezd avatar Jul 12 '22 10:07 janezd

As you say, using get_column_view on an unknown table is wrong.

For me, this is a confusing mixture of two very different behaviors. It implies a fast operation. So I do not like the name. I do not have any better ideas though.

markotoplak avatar Jul 15 '22 09:07 markotoplak

Would it be possible to have something similar to pandas' data["sepal length"]? This would not require to name the method, but it would likely require too much messing with the existing Table structure.

ajdapretnar avatar Jul 15 '22 09:07 ajdapretnar

@ajdapretnar, that would be more messing with the data structure. :) Especially in this case because the actual variable (with its compute value) is important, not the name.

Now I think get_column is fine, especially since then programmers who do not completely understand what is going on will find it first. The function you wrote is still as fast as get_column_view when the latter would be appropriate, so I see no loss.

I'd deprecate get_column_view.

Slightly related: do we ever enforce what shape functions doing compute_value return? I remember we have some code that reshapes compute_value outputs within table transforms, but that might be something stale? Is the shape of the output array always the same?

markotoplak avatar Jul 15 '22 10:07 markotoplak

I checked what should the compute_value do. We never enforce what its output should look like. Instead, match_density assures it is a (1D array for dense and 2D for sparse) after calling it in _ArrayConversion.get_columns. So I was running tests with assert statements:

@@ -299,10 +299,14 @@ class _ArrayConversion:
                     if shared is None:
                         shared = col.compute_shared(sourceri)
                         _idcache_save(shared_cache, (col.compute_shared, source), shared)
+                    ttt = col(sourceri, shared_data=shared)
                     col_array = match_density(
-                        col(sourceri, shared_data=shared))
+                            ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
                 else:
-                    col_array = match_density(col(sourceri))
+                    ttt = col(sourceri)
+                    col_array = match_density(ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
             elif col < 0:

For dense arrays, compute_value usually returns 1D numpy arrays. There are exceptions. PCA tests fail because there is a column array at one spot, then there are problems with t-SNE, and for the Feature Constructor, compute_value can be also a constant or even a python list.

Slicing on sparse matrices, on the other hand, always returns something that is 2D, because they do not have 1D slices at all.

For now, the output shape of this get_columns is thus undefined. :) Should we reshape it somehow? Also, if we want the same output shape for sparse and dense matrices, a (rows, 1) shaped vector is currently the only way to go. What do you think, @janezd?

markotoplak avatar Jul 15 '22 12:07 markotoplak

You're right, the result of compute_value should be reshaped. compute_value is currently called in domain conversion, which (for dense arrays) goes through assure_column_dense, which calls reshape(-1).

What about sparse data?

We could also request compute_value to return either a vector or a sparse matrix. For a few years, returning anything else would trigger a warning.

I'd deprecate get_column_view.

Me too, but its everywhere and we'll never remove it. We can just recommend not using it.

  • One argument is that we (in particular: I) use(d) it where we shouldn't.
  • Another: get_column_view does not even always return a view! In '19, that @janezd guy added code that maps between different ordering of categorical values!

There may be cases when one needs a view, e.g. because you just created and unlocked a table. As current function does not really always give a view, but it has a good name that we can't reuse: the function get_column that I propose here has an argument copy that, if set to True, forces a copy. We could add view=False; if set to True, it would return a view and fail if it can't.

(Setting both view and copy to True would always return a sparse matrix, because the joker who calls it this way deserves it. 🤪)

janezd avatar Jul 16 '22 09:07 janezd

Codecov Report

Merging #6058 (3a549a2) into master (036e96d) will increase coverage by 0.00%. The diff coverage is 94.59%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6058   +/-   ##
=======================================
  Coverage   86.65%   86.65%           
=======================================
  Files         315      315           
  Lines       67957    67974   +17     
=======================================
+ Hits        58890    58905   +15     
- Misses       9067     9069    +2     

codecov[bot] avatar Oct 20 '22 12:10 codecov[bot]

@janezd, thanks for this. All the new usages look much simpler, especially when the values needed to be set. Merging as soon as the tests pass.

markotoplak avatar Oct 24 '22 11:10 markotoplak