loc
loc copied to clipboard
discrepancy reported between loc and cloc on the exact same repo
I just cloned and cargo built loc
from the repo. Here are the results I've tested on the exact same test code base.
As you can see there are quite significant discrepancies reported by the two programs. If loc
were a re-implementation of cloc
then I would expect the discrepancies to be small if there were any.
$ loc -V
count 0.1
$ loc codebase
---------------------------------------------------------------------------------
Language Files Lines Blank Comment Code
---------------------------------------------------------------------------------
JavaScript 9080 1032352 131646 225231 675475
JSON 1139 133076 369 0 132707
Markdown 1115 159295 46234 0 113061
Python 207 70457 12095 4561 53801
C++ 64 26719 3483 3171 20065
HTML 211 21543 2607 1867 17069
Sass 112 18359 1665 1497 15197
C/C++ Header 97 17423 2551 1711 13161
XML 21 9826 475 22 9329
YAML 247 6611 260 70 6281
CSS 41 7625 1157 529 5939
Plain Text 75 1933 330 0 1603
Makefile 49 2624 438 738 1448
SQL 2 1325 238 0 1087
Lua 6 1209 225 36 948
TypeScript 2 1038 141 104 793
Less 3 797 94 11 692
Bourne Shell 21 840 142 123 575
Autoconf 4 799 74 263 462
Lisp 4 350 42 38 270
ASP.NET 6 265 0 0 265
Handlebars 4 200 18 0 182
C 5 258 45 37 176
CoffeeScript 11 112 23 9 80
Ruby 3 26 4 2 20
Batch 2 10 2 0 8
Z Shell 1 25 4 15 6
---------------------------------------------------------------------------------
Total 12532 1515097 204362 240035 1070700
---------------------------------------------------------------------------------
$ cloc codebase
13950 text files.
9763 unique files.
5923 files ignored.
https://github.com/AlDanial/cloc v 1.66 T=58.68 s (138.1 files/s, 20273.4 lines/s)
-----------------------------------------------------------------------------------
Language files blank comment code
-----------------------------------------------------------------------------------
JavaScript 6082 110844 185893 584265
JSON 1048 331 0 123278
Python 189 12106 8513 49919
C++ 64 3483 3174 20062
HTML 204 2599 165 18683
SASS 111 1665 1078 15616
C/C++ Header 97 2550 1711 13162
XML 19 242 11 7381
CSS 39 1157 528 5940
YAML 156 241 66 5684
Bourne Shell 24 474 454 2136
SQL 2 238 0 1087
TypeScript 2 141 104 793
Lua 5 168 27 686
LESS 2 82 10 606
make 25 178 40 575
m4 2 40 2 266
Lisp 3 42 38 264
Bourne Again Shell 10 54 27 184
C 3 31 29 130
Smarty 6 17 30 91
CoffeeScript 5 16 8 65
Handlebars 2 8 0 42
Windows Resource File 1 1 1 33
Ruby 2 2 2 12
DOS Batch 2 2 0 8
zsh 1 4 14 7
-----------------------------------------------------------------------------------
SUM: 8106 136716 201925 850975
-----------------------------------------------------------------------------------
PS: loc
took around 2-3 seconds to finish, it would be nice to have the elapsed time reported in the result output as well. And cloc
took almost a minute, so it's about 20-30x improvement not the 100x as claimed.
Hmm, will look into these later but at the moment I'm inclined to trust mine, since I believe I'm very slightly more accurate on c++, and javascript comments should be the same as c++. I'm hoping to put together a script soon to identify the files with the largest discrepancies for manual testing. Will get back to you when I do.
Re timing: was that cold cache on loc and warm on cloc? Can you try running loc twice? I just got 160x faster testing them both against a large code base (openbsd). If not, let me know.
@cgag re: discrepancies yes, I think if you can have a set of baseline comparison tests that would be great and it would be very helpful to find bugs too since cloc
is battlefield tested and proven for the most part.
re: speed improvement stats. I think you are right, unfortunately I didn't time the first run but subsequent runs were much faster (~0.25s to ~0.28s)!!! Now I wonder what were the stuff that got cached? And where does the cached data get stored?
For the timing, the operating system caches files it accesses in memory by default, so the first time you read it, it has to get it from disk, but the second time you try to read a file, it should be read from memory, which is much much faster. The OS will use any free memory to cache files until an application needs it.
If so, I wonder why cloc
is timed at the same ball park (~1 min) on repeated runs?
That's because cloc
is CPU bound. It counts more slowly than it reads off disk, so the CPU is the bottleneck, which caching files in memory doesn't help. For loc
, reading off of disk is the slowest portion, so making it faster through caching provides a huge speed up.
I ran this on the valgrind repository (checkout r16117).
cloc/loc reported different results cloc:
5146 text files.
4437 unique files.
7445 files ignored.
http://cloc.sourceforge.net v 1.60 T=11.85 s (243.7 files/s, 126250.3 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
C 1185 97930 107742 635880
Expect 921 24146 6947 451965
C/C++ Header 324 13612 24090 54026
XML 136 3870 733 21655
Assembly 57 2573 3271 8428
make 79 983 470 7989
C++ 28 1376 1377 7138
Teamcenter def 13 0 213 4658
m4 1 531 4 3913
Perl 17 729 518 3290
Bourne Shell 107 538 634 2160
XSLT 6 189 125 1152
Bourne Again Shell 8 95 131 377
Haskell 4 109 70 250
XSD 1 17 10 211
Korn Shell 1 31 24 150
CSS 1 10 4 53
--------------------------------------------------------------------------------
SUM: 2889 146739 146363 1203295
--------------------------------------------------------------------------------
loc:
--------------------------------------------------------------------------------
Language Files Lines Blank Comment Code
--------------------------------------------------------------------------------
C 1192 839672 97549 106976 635147
C/C++ Header 324 91728 13613 24072 54043
XML 136 26258 3870 732 21656
Makefile 83 9991 1071 505 8415
Plain Text 49 10557 2593 0 7964
Assembly 57 14272 2573 4524 7175
C++ 28 9891 1377 1377 7137
Autoconf 14 5751 800 1361 3590
Perl 4 2694 419 129 2146
Bourne Shell 3 621 14 4 603
Haskell 4 429 109 70 250
CSS 1 67 10 4 53
--------------------------------------------------------------------------------
Total 1895 1011931 123998 139754 748179
--------------------------------------------------------------------------------
I've run a file-by-file diff (only on .c files) and here's the result. Lines prefixed with ">" are from loc, "<" are from cloc. A lot of off-by-1 or 2, more interestingly there are certain files where loc reports 0 lines.
I can't say that cloc is 100% correct but on the few files I inspected loc indeed miscounted the number of lines.
Also an example of something loc would report wrongly: (C file)
Bool h_clo_partial_loads_ok = True; /* user visible */
/* Bool h_clo_lossage_check = False; */ /* dev flag only */
loc would return 2. cloc returns 1. Edit: fix for this
Edit 2: On this file it fails because it's ISO_8859_1 encoded and the code assumes every file is utf8
Good catch on the whitespace, I guess I should have bothered with it. I looked at the PR and it looks correct but chars.nth(pos) is probably less than ideal since nth is O(n) and we should be safe to just index into trimmed due to the is_character_boundary checks at the top of the loop.
Thanks for the valgrind diff. I'll test out the whitespace change and start digging into any differences. In my experiences cloc is off by one fairly often, but any larger differences should be interesting.
Are you going to add support for other forms of encoding, or is the tool going to be limited to utf8 files only?
ISO-8859-1
support offered by pvdb/gloc :wink:
Chiming in on this, I have a repo where both cloc and loc vastly misses the number of lines for Python code (60k for cloc, 8.8k for loc, 93k in reality (!)). Let me know if you want some data regarding this @cgag !
Edit: tokei seems to get the numbers 100% correct