Consider taking shebang line into account when identifying files
Currently, searchcode does not seem to take shebang lines into account then identifying the language. I.e. a file starting with #! /usr/bin/env python3 should be identified as Python even if the .py file extension is missing.
Or, for an even more sophisticated solution, maybe something like https://github.com/github/linguist could be used.
I just moved this over to use the same list that http://github.com/boyter/scc/ uses actually. However it is totally based on file extensions.
There used to be some logic in there to guess the file but that was only in the case of duplicate extensions. It was very slow and inaccurate hence its removal.
This looks like a reasonable compromise.
Have updated based on scc to now work with duplicate extensions.
As for dealing with shebang... that might be better as a pure searchcode implementation as it would needlessly slow down scc.
@sschuberth I don't suppose you know of some sort of list of these? If I can get them all in one go it would save some time.
I don't suppose you know of some sort of list of these?
No, and I don't believe there can be such an official / complete list, because you can use the path to any arbitrary interpreter after !#. My suggestion is to simply hard-code a few common cases in the form of (pseudocode)
if first line starts with "'!#" then
if first line contains case insensitive "python" then
language = Python
else if first line contains case insensitive "ruby" then
language = Ruby
end
// ...
end
And only do the above as a fallback if the language hasn't been identified yet by other means, like the file extension.
The above could probably be implemented as a FileTypeDetector to power probeContentType.
Alternatively, maybe you can find a way to use the GNU file command's "database" at https://github.com/file/file/tree/master/magic/Magdir from Java (e.g. via https://github.com/j256/simplemagic), as the file command already seems to recognize most shebang lines.
I had a feeling that was the case, but was hoping it not to be.
There are some pretty neat ideas in file. I shamelessly steal ideas from the GNU tools so I might have a look in there as well. Thanks for the pointers.