scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

Defer (default) fptype/method selection until data arrives

Open fschlimb opened this issue 6 years ago • 2 comments

Analyze underlying types of arrays passed to compute() method. Usually we pass numpy arrays into compute() method, but don`t care about the underlying type of this array. So if we pass numpy array read from csv (where dtype=np.float64 by default) and feed the algorithm with it, we may give significant performance degradation caused by internal data conversion from double to float (if algorithm is specified by fptype=’float’). So maybe it is useful to develop automatic fptype detection based on data, which the algorithm is fed with. Or simply notify user by warning in stdout about conversion of data to be made.

The same issues exists for CSR input, the user should by default get the fastest method and not be required to select csr method manually.

The problem here is of course that we cannot create the algorithm until we know the input data. In general it should be possible to defer the creation until we have the data. There are some technical details in daal4py to work out. More importantly this raises a few user-visible issues, like

  • DAAL’s parameter checking will be deferred as well (and so the user will get a message triggered by a ‘unrelated’ line of code)
  • what should happen if a kernel is setup by the user with partial input-data and DAAL uses it internally by other algorithms (like optimization solver pattern)? What if the user changes the partial input?

fschlimb avatar Feb 04 '19 16:02 fschlimb

DAAL’s parameter checking will be deferred as well

As I know, all parameter-checking procedures in DAAL are "deferred" and run in a compute() call. Have you code in daal4py to check all parameter values? Maybe it duplicates the same checks in DAAL and we might reuse DAAL functionality for that.

And it not seems to me that all parameter values are dependent on an algorithm to be created. Maybe most of them are general for whole method and therefore may be checked without algorithm creation. So might you check all these "general" parameter values into "constructor" of algorithm (which performs no construction of algorithms actually, just some sort of it`s description) and construct algorithm in place of compute()?

what should happen if a kernel is setup by the user with partial input-data and DAAL uses it internally by other algorithms (like optimization solver pattern)?

Probably in this situation we should give an opportunity to disable algorithm auto-selection :) For example, if user creates an algorithm and specifyes all parameters which describe type of data will be passed in than this is no need to deferred initialization

michael-smirnov avatar Feb 06 '19 06:02 michael-smirnov

For example, if only fptype parameter describes data type (simplified case), then we might set its default value to None (which will cause deferred initialization). Other values which may be set by user (float or double) enforces to strictly create algorithms to be used with float or double arrays of data.

michael-smirnov avatar Feb 06 '19 07:02 michael-smirnov