searchcode-server icon indicating copy to clipboard operation
searchcode-server copied to clipboard

Consider taking shebang line into account when identifying files

Open sschuberth opened this issue 7 years ago • 6 comments

Currently, searchcode does not seem to take shebang lines into account then identifying the language. I.e. a file starting with #! /usr/bin/env python3 should be identified as Python even if the .py file extension is missing.

sschuberth avatar Jul 03 '18 07:07 sschuberth

Or, for an even more sophisticated solution, maybe something like https://github.com/github/linguist could be used.

sschuberth avatar Jul 03 '18 14:07 sschuberth

I just moved this over to use the same list that http://github.com/boyter/scc/ uses actually. However it is totally based on file extensions.

There used to be some logic in there to guess the file but that was only in the case of duplicate extensions. It was very slow and inaccurate hence its removal.

This looks like a reasonable compromise.

boyter avatar Jul 03 '18 21:07 boyter

Have updated based on scc to now work with duplicate extensions.

As for dealing with shebang... that might be better as a pure searchcode implementation as it would needlessly slow down scc.

@sschuberth I don't suppose you know of some sort of list of these? If I can get them all in one go it would save some time.

boyter avatar Mar 10 '19 22:03 boyter

I don't suppose you know of some sort of list of these?

No, and I don't believe there can be such an official / complete list, because you can use the path to any arbitrary interpreter after !#. My suggestion is to simply hard-code a few common cases in the form of (pseudocode)

if first line starts with "'!#" then
    if first line contains case insensitive "python" then
        language = Python
    else if first line contains case insensitive "ruby" then
        language = Ruby
    end
    // ...
end

And only do the above as a fallback if the language hasn't been identified yet by other means, like the file extension.

The above could probably be implemented as a FileTypeDetector to power probeContentType.

sschuberth avatar Mar 11 '19 09:03 sschuberth

Alternatively, maybe you can find a way to use the GNU file command's "database" at https://github.com/file/file/tree/master/magic/Magdir from Java (e.g. via https://github.com/j256/simplemagic), as the file command already seems to recognize most shebang lines.

sschuberth avatar Mar 11 '19 09:03 sschuberth

I had a feeling that was the case, but was hoping it not to be.

There are some pretty neat ideas in file. I shamelessly steal ideas from the GNU tools so I might have a look in there as well. Thanks for the pointers.

boyter avatar Mar 11 '19 21:03 boyter