company-statistics Make score calc features customizable

Would you consider making the features (e.g., :keyword, :symbol and :file) used by the scoring functions customizable?

I created my own context/calc/change functions that duplicate a lot of stuff from the default versions but primarily introduces an additional customizable variable where I tweak the features used to score candidates. It looks something like this:

;; :total, :mode, :lookup-prefix and :file are the features I use (in place of :keyword,
;; :symbol and :file).
;; The keywords following each of them define how they are captured, incremented and
;; summed to calculate the score.
(setq dg-company-statistics-default-features
      `(
        (total
         (:get-context . t)
         (:incr . 1))

        (mode
         (:get-context . (dg-company-statistics-get-effective-major-mode))
         (:incr . 1))

        (lookup-prefix
         (:get-context . (dg-company-statistics-lookup-prefix))
         (:weight . 5)
         (:incr . 1))

        (file
         (:get-context . buffer-file-name)
         (:incr . 1))
        ))

I think the statistics could be much smarter but requires making use of more features, many of which would be specific to certain major modes. In order to make it worthwhile developing a statistically robust candidate scoring function utilising this, it needs to be general enough to share among multiple modes. Therefore, the context/change/score function implementations ought to be decoupled from the features.

If you would consider this, I could look at adapting my configuration so that company-statistics would still work the same way by default but the :keyword, :file and :symbol features would be pulled out of the default functions and placed into a new customizable alist similar to the one above.

Oct 18 '17 17:10 dgillis

Interesting thought! But couldn't you just create the context/change/score functions from the feature list using a macro? The existing "features", as you call them, already work like this (in the "heavy" variant): they may return a nil context, and this is discarded. So are you talking about only the change and calc part?

Excuse my Qs, I have not worked on that code for a long time (while using it constantly). Could you lay out briefly a suggestion what would need to change in company-statistics for your idea to work?

Oct 18 '17 18:10 ilohmar

Interesting thought! But couldn't you just create the context/change/score functions from the feature list using a macro?

It would be possible to generate the functions from the feature list; but in that case, you would need to generate different functions for each version of the feature list (note that the feature list would likely differ between major modes). Thus it would become error prone, requiring some hook to locally set the functions on changing modes, and hard to debug since you'd be working with dynamically generated functions.

The existing "features", as you call them, already work like this (in the "heavy" variant): they may return a nil context, and this is discarded. So are you talking about only the change and calc part?

But the existing "features" are hard-coded right into the the context/change/calc functions (the heavy versions). For example, if I wanted everything to work the same, but introduce one more feature -- say, the major mode as :mode -- there appears to be no easy way of doing this aside from copying the default functions and adding a single-line to each.

Excuse my Qs, I have not worked on that code for a long time (while using it constantly). Could you lay out briefly a suggestion what would need to change in company-statistics for your idea to work?

My suggestion would be implementing the context/change/calc functions so that they work the same but the list of features is kept as an external customizable variable, as in the code snippet in my previous message.

For example, here is my implementation of the capture context function that utilizes configurable features:

(defun dg-company-statistics-capture-context (&optional _manual)
  (save-excursion
    (backward-char (length company-prefix))
    (let* ((ctx (mapcar (lambda (feat)
                          (let* ((key (car feat))
                                 (cfg (cdr feat))
                                 (get-ctx-form (cdr (assoc :get-context feat)))
                                 (value (eval get-ctx-form)))
                            (when value
                              (list key value))))
                        dg-company-statistics-features))
          (ctx-no-nils (delq nil ctx)))
      (setq company-statistics--context ctx-no-nils)
      ctx-no-nils)))

Now I can configure dg-company-statistics-feature so that this function returns the same thing as company-statistics-capture-context-heavy. And if I want to add/remove/change features, I only need to configuration alist rather than making a whole new function.

I have similarily implemented change and calc functions that are based on the current ones, but utilize the configuration alist as well.

The motivation for making all this stuff configurable is that I think that will make it possible to do much better candidate selection. But to do this, the ability to customize the features is essential.

One last thing, here's an example benefit of doing all this: my Javascript completion works vastly better since it takes into account the Javascript node types around the completion point; this prevents non-sensical completions such as return (a restricted keyword) occuring on the righthand side of an assignment, as the completion can detect that something called "return" is unlikely to occur following "=". Making this change just required adding a couple entries to the configuration:

;; Take into account the previous two JS syntax node types when deciding on a completion
;; candidate.
(setq dg-js2-mode-company-statistics-features
      (append
       '(
         (previous-node-type
          (:get-context . (dg-js2-previous-statement-node-type 1))
          (:weight . 1))
         (previous-previous-node-type
          (:get-context . (dg-js2-previous-statement-node-type 2))
          (:weight . 1))
         )
       dg-company-statistics-features))

Oct 18 '17 19:10 dgillis

After letting this sink in a bit, I'm torn: It may not be completely obvious, but the function triples in the package were only meant to be examples (that happen to work reasonably well for me).

So there are two ways to proceed.

EITHER we move forward with your approach. This will introduce at least three new functions, and one new variable for the feature list (I would prefer using buffer-local values of a single variable, with a value that is set in a mode's respective hook), and this whole mechanism will look even more like it's part of the package's "core" idea (which it isn't).

Also, we would need to prescribe a simple flexible way to describe the features, as you have laid out in the snippets above. I am honestly not convinced this is worth the complexity in general: js2-mode is kind of an oddball, as it has the whole AST (most modes don't).

OR I just tell you that this is fine to customize with your own versions, but that I will not include it in the package, which seems a shame.

Do you think it is sensible to turn this into another package? It could be very small, just implementing the functions, the variable, and maybe a wrapper minor-mode around company-statistics.

Oct 20 '17 20:10 ilohmar

I forked C-S and integrated my changes here (I added "dg-" prefixes to all the functions so I can test it against C-S side by side).

I think how to proceed depends on if you like the feature-configuration method of customizing the package more than custom functions. In my opinion, customizable "features" is a big win here: a user can very quickly add a new feature so that statistics takes into account something of importance to their use case that it is currently missing. In contrast, having to grok the interplay between the context/change/calc functions and then implement your own equivalents is far more time consuming and something I suspect most users which just decide is not worth it.

However, as you noted, this does add a bit of complexity to the code and make the package more abstract. I think this is ultimately worthwhile because it facilitates many refinements to the scoring that would not be feasible under the current setup; for example, note that my score calc function can use totals and percentages across a particular feature (and I think using percentages makes a lot of sense for rare but highly informative features that will otherwise count for very little using a simple sum). Eventually I would like to see this use some sort of Bayesian prediction to further improve candidate selection; again, this is not feasible/worthwhile without the hard coded features being abstracted out of the scoring functions.

Oct 23 '17 17:10 dgillis

Thanks for letting me know! Well, for the time being I think this is the best approach. Maybe work on your fork will quickly lead to something you're happy with (or maybe it already has), and I would love to have a closer look when I find the time.

Oct 26 '17 21:10 ilohmar

company-statistics company-statistics copied to clipboard

Make score calc features customizable

company-statistics
company-statistics copied to clipboard