Simple Instrumentation of Enry predictions in dev mode
To assist debugging in dev mode, it would be nice to have some visibility into the decision-making logic that Enry uses at runtime.
Problem: after getting a final prediction e.g though enry.GetLanguage() it's very hard to tell:
- what strategies were used
- what suggestions each strategy made
- what was the winning strategy
Such introspection would simplify maintenance and reduce the time to debug miss-predictions in case of sync-ups with Linguist, etc.
Linguist does have a simple protocol for Linguist.instrumenter that serves this needs and is very generic, \w ability to be deployed and enabled in production, etc.
Something simpler, similar to a LocalInstrumenter (in details below) that is propagated to every Strategy would work for Enry in Golang development mode and is subject of this issue.
class LocalInstrumenter
Event = Struct.new(:name, :args)
attr_reader :events
def initialize
@events = []
end
def instrument(name, *args)
@events << Event.new(name, args)
yield if block_given?
end
end
Linguist.instrumenter = LocalInstrumenter.new
would produce
#<struct LocalInstrumenter::Event
name="linguist.strategy",
args=
[{:blob=>
...full file content...
@detect_encoding=
{:type=>:text,
:encoding=>"UTF-8",
:ruby_encoding=>"UTF-8",
:confidence=>80},
@encoded_newlines_re=/\r\n|\r|\n/,
@fullpath=".linguist/samples/C++/Types.h",
@path=".linguist/samples/C++/Types.h",
@size=1484,
@symlink=false>,
:strategy=>Linguist::Heuristics,
:candidates=>
[#<Linguist::Language name=C>,
#<Linguist::Language name=C++>,
#<Linguist::Language name=Objective-C>]}]>,
#<struct LocalInstrumenter::Event
name="linguist.detected",
args=
[{:blob=>
...full file content...
@detect_encoding=
{:type=>:text,
:encoding=>"UTF-8",
:ruby_encoding=>"UTF-8",
:confidence=>80},
@encoded_newlines_re=/\r\n|\r|\n/,
@fullpath=".linguist/samples/C++/Types.h",
@path=".linguist/samples/C++/Types.h",
@size=1484,
@symlink=false>,
:strategy=>Linguist::Heuristics,
:language=>#<Linguist::Language name=C++>}]>]>
At current state, simplistic version of this is possible by hard-coding log statements in
diff --git a/common.go b/common.go
index 949db71..d4a6c57 100644
--- a/common.go
+++ b/common.go
@@ -3,11 +3,14 @@ package enry
import (
"bufio"
"bytes"
+ "log"
"path/filepath"
"strings"
"gopkg.in/src-d/enry.v1/data"
"gopkg.in/src-d/enry.v1/regex"
+
+ "github.com/sanity-io/litter"
)
// OtherLanguage is used as a zero value when a function can not return a specific language.
@@ -118,6 +121,7 @@ func GetLanguageBySpecificClassifier(content []byte, candidates []string, classi
// At least one of arguments should be set. If content is missing, language detection will be based on the filename.
// The function won't read the file, given an empty content.
func GetLanguages(filename string, content []byte) []string {
+ log.Printf("file:%s\n", filename)
if IsBinary(content) {
return nil
}
@@ -126,6 +130,8 @@ func GetLanguages(filename string, content []byte) []string {
candidates := []string{}
for _, strategy := range DefaultStrategies {
languages = strategy(filename, content, candidates)
+ log.Printf("\tstrategy:%s, langs:%q\n", litter.Sdump(strategy), languages)
+
if len(languages) == 1 {
return languages
}
diff --git a/data/heuristics.go b/data/heuristics.go
index dc3663d..c894985 100644
--- a/data/heuristics.go
+++ b/data/heuristics.go
@@ -1,6 +1,11 @@
package data
-import "regexp"
+import (
+ "log"
+ "regexp"
+
+ "github.com/sanity-io/litter"
+)
type (
Heuristics []Matcher
@@ -20,7 +25,10 @@ type (
func (h *Heuristics) Match(data []byte) []string {
var matchedLangs []string
+ litter.Config.Compact = true
+
for _, matcher := range *h {
+ log.Printf("matcher:%s\n", litter.Sdump(matcher))
if matcher.Match(data) {
for _, langOrAlias := range matcher.(Rule).GetLanguages() {
lang, ok := LanguagesByAlias(langOrAlias)
@@ -31,6 +39,7 @@ func (h *Heuristics) Match(data []byte) []string {
}
matchedLangs = append(matchedLangs, lang)
}
+ log.Printf("\t\tlangs:%q\n", matchedLangs)
break
}
}
but the idea is to provide API with simple instrumentation for all strategies instead, which can be used in tests to archive similar results.