enhancement_proposals
enhancement_proposals copied to clipboard
SLEP012 input_feature_names_
In SLEP012, it suggests a new attribute, input_feature_names_, onto the estimators:
https://github.com/scikit-learn/enhancement_proposals/blob/master/slep012/proposal.rst#L25
Is this in scope of the SLEP?
I kinda wanted to have it about the DataArray only at first, but through discussions we decided to include that in the SLEP.
It looks like the two motivations for the InputArray is for:
- Input feature names for model inspection, which pushes for defining
input_feature_names_. - Allow estimators to treat columns differently. (For fairness or categories)
Would voting for SLEP 012 indirectly vote for input_feature_names_? If not, the SLEP vote is kind of for "infrastructure". In other words, it will enable us to work on the two items above.
(BTW I trying to see what we need to do to get SLEP 012 to be ready to vote on)
Is there any other use cases you see for InputArray?
I guess I need to give a bit of a context.
The history of feature names "feature" (:D) seems something like the following to me (please correct me if I'm missing something @amueller and @jorisvandenbossche ):
-
a
get_feature_namesis introduced in some of the estimators. In most estimators it's the output feature names, but if implemented in a predictor, or a pipeline which has a predictor as the last step, it's the input features to the predictor (hence kind of inconsistent). -
Propagating the feature names through a pipeline in order to be able to inspect the feature names going into each step, can be done if we allow
get_feature_namesto accept an existing set of feature names (from the previous step, in the pipeline), and having theget_feature_namesof the pipeline doing that for us, would allow the user to inspect the features going to the last step. This would be done once thefitis called. Some of the drawbacks:get_feature_namesnot clear if it's the input or output, only works in a pipeline, estimators don't have access to the feature names atfittime (@amueller has an implementation of this). -
Pipeline propagates the feature names at fit time, getting the output feature names of each step, passing it to the next step's
fit. This could either be done withget_feature_namesor thefeature_names_in_andfeature_names_out_. The latter makes it clear which one's which. It also means estimators have access to the feature names duringfit. However, it's quite a bit of change in the pipeline, and also requires us to pass an especialfeature_namesarg tofit, which doesn't seem the nicest design (@amueller has an implementation of this). -
Building on the previous step, we could use a data structure which includes the feature names as the output of
transformand pass that tofit, and have estimators understand feature names from that. I had an implementation of this usingxarrayand we seemed pretty okay with the implementation. It also doesn't touch the pipeline process, since it's transparent to the pipeline, or other meta-estimators. The main issue with this solution is dependency onxarrayand hencepandas. -
Further developing the idea, we had a few discussions about developing a new data structure to keep the feature names alongside the data, and that's how the
DataArrayorInputArraycame to be. It has the downside of introducing yet another data structure to the ecosystem, and that we'd need to maintain it. We also contemplated the idea of ideally pushing it out of sklearn in the future to have it a standalone package. The upside is that we'd avoidxarrayand/orpandasdependency. It's also more of a backward compatible implementation since any operation on our object would result in anumpyarray.Understanding this history, makes it kinda clear (I hope) why the SLEP includes
feature_names_in_. If you look at the SLEP's history, you'll see that I started it as the infrastructure slep as you mention, but the reviewers were not happy with it since it didn't have enough motivation, and that motivation would include the introduction of feature names. -
The latest development of the idea is to have feature names implemented using
xarrayand/orpandas, but with a soft dependency, i.e. users would be able to have/enable the feature only if they have the dependency installed. From my conversations with a few of the core-devs, it seems it's a solution which would rather easily get a consensus.
With the above background in mind, I'm hoping that the last solution would gain consensus, and I hope the implementation is not too complicated. There are a few bits which need to be discussed, such as:
xarrayorpandas? I very much preferxarray, Andy preferspandas, and there are cons and pros on either side.- global config or a local one, to enable feature names
- is it enabled by default if user passes a data structure containing feature names
- what's the default value of those configs
- probably off b default (on both accounts) for at least a few releases since returning the new data structure is not backward compatible with people's existing code
The parts where we seem to already do have a consensus on are (I think):
- having both
feature_names_in_andfeature_names_out_ - deprecating
get_feature_names - transformers returning a data structure which includes the feature names, if feature names are enabled
If we move towards the latest solution (which I support), then I'd withdraw SLEP012 in favor of the new solution.
Some additions:
-
It's right now always the output feature names, so it's consistent, though potentially unexpected.
-
This got a bit better since we can now slice pipelines and we could do
pipe[:-1].get_feature_names(). That doesn't solve issues for other meta-estimators (if we want feature names there) though. -
If we do this with
get_feature_names()it has similar limit limitations wrt other meta-estimators like 2). It's true that I mixed the fit-time aspect with thefeature_names_in_andfeature_names_out_parts. These are somewhat orthogonal but together they help consistently addressing all meta-estimators. -
It's not entirely clear whether this requires adding
feature_names_in_. Probably it does.
I agree with the summary. Though I think one aspect you're missing is that @jnothman (and also @thomasjpfan) have concerns that having feature_names_in_ and feature_names_out_ will mean storing potentially generating and storing a lot of strings.
Will the HashingVectorizer create 1M strings at fit time independent of the input? Will they be copied at every step in a pipeline? We could not have feature names for sparse data. But then we can't really deprecate get_feature_names in CountVectorizer.
Having it be opt-in might solve some of these issue, but that means you need to decide ahead of time whether you might want to look at the feature names or not.