Simple Instrumentation of Enry predictions in dev mode

Open bzz opened this issue 7 years ago • 1 comments

To assist debugging in dev mode, it would be nice to have some visibility into the decision-making logic that Enry uses at runtime.

Problem: after getting a final prediction e.g though enry.GetLanguage() it's very hard to tell:

what strategies were used
what suggestions each strategy made
what was the winning strategy

Such introspection would simplify maintenance and reduce the time to debug miss-predictions in case of sync-ups with Linguist, etc.

Linguist does have a simple protocol for Linguist.instrumenter that serves this needs and is very generic, \w ability to be deployed and enabled in production, etc.

Something simpler, similar to a LocalInstrumenter (in details below) that is propagated to every Strategy would work for Enry in Golang development mode and is subject of this issue.

class LocalInstrumenter
  Event = Struct.new(:name, :args)
   attr_reader :events
   def initialize
    @events = []
  end
   def instrument(name, *args)
    @events << Event.new(name, args)
    yield if block_given?
  end
end
Linguist.instrumenter = LocalInstrumenter.new

would produce

   #<struct LocalInstrumenter::Event
    name="linguist.strategy",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :candidates=>
        [#<Linguist::Language name=C>,
         #<Linguist::Language name=C++>,
         #<Linguist::Language name=Objective-C>]}]>,

  #<struct LocalInstrumenter::Event
    name="linguist.detected",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :language=>#<Linguist::Language name=C++>}]>]>

Jan 09 '19 17:01 bzz

At current state, simplistic version of this is possible by hard-coding log statements in

diff --git a/common.go b/common.go
index 949db71..d4a6c57 100644
--- a/common.go
+++ b/common.go
@@ -3,11 +3,14 @@ package enry
 import (
 	"bufio"
 	"bytes"
+	"log"
 	"path/filepath"
 	"strings"
 
 	"gopkg.in/src-d/enry.v1/data"
 	"gopkg.in/src-d/enry.v1/regex"
+
+	"github.com/sanity-io/litter"
 )
 
 // OtherLanguage is used as a zero value when a function can not return a specific language.
@@ -118,6 +121,7 @@ func GetLanguageBySpecificClassifier(content []byte, candidates []string, classi
 // At least one of arguments should be set. If content is missing, language detection will be based on the filename.
 // The function won't read the file, given an empty content.
 func GetLanguages(filename string, content []byte) []string {
+	log.Printf("file:%s\n", filename)
 	if IsBinary(content) {
 		return nil
 	}
@@ -126,6 +130,8 @@ func GetLanguages(filename string, content []byte) []string {
 	candidates := []string{}
 	for _, strategy := range DefaultStrategies {
 		languages = strategy(filename, content, candidates)
+		log.Printf("\tstrategy:%s, langs:%q\n", litter.Sdump(strategy), languages)
+
 		if len(languages) == 1 {
 			return languages
 		}
diff --git a/data/heuristics.go b/data/heuristics.go
index dc3663d..c894985 100644
--- a/data/heuristics.go
+++ b/data/heuristics.go
@@ -1,6 +1,11 @@
 package data
 
-import "regexp"
+import (
+	"log"
+	"regexp"
+
+	"github.com/sanity-io/litter"
+)
 
 type (
 	Heuristics []Matcher
@@ -20,7 +25,10 @@ type (
 
 func (h *Heuristics) Match(data []byte) []string {
 	var matchedLangs []string
+	litter.Config.Compact = true
+
 	for _, matcher := range *h {
+		log.Printf("matcher:%s\n", litter.Sdump(matcher))
 		if matcher.Match(data) {
 			for _, langOrAlias := range matcher.(Rule).GetLanguages() {
 				lang, ok := LanguagesByAlias(langOrAlias)
@@ -31,6 +39,7 @@ func (h *Heuristics) Match(data []byte) []string {
 				}
 				matchedLangs = append(matchedLangs, lang)
 			}
+			log.Printf("\t\tlangs:%q\n", matchedLangs)
 			break
 		}
 	}

but the idea is to provide API with simple instrumentation for all strategies instead, which can be used in tests to archive similar results.

Jan 10 '19 10:01 bzz