pygount icon indicating copy to clipboard operation
pygount copied to clipboard

Add code_count to JSON output

Open SeanTConrad opened this issue 2 years ago • 4 comments

Story

As user of the JSON format I want...

  • ...to have access to the various code counts so that I can decide myself which one fits my purpose best
  • ...to understand the differences of the various code counts so that I can make an educated choice which one fits my purpose best.

Goals

  • [ ] The JSON output includes code_count.
  • [ ] The documentation section on How pygount counts code includes an explanation of the different counts, especially the difference between code_count and source_count.

Original request: Format=json summary "totalSourceCount" doesn't match format=summary Code Sum

I apologize if I am missing an option or documentation on this.

I was testing Pygount by running on the public Rails repo: git clone [email protected]:rails/rails.git

First, I test the summary format with: pygount -F=.git,node_modules --format=summary rails

The resulting "Sum" for the "Code" column is 382801 image

Then, I test the JSON output with: pygount -F=.git,node_modules --format=json -o pygount_test.json rails

The resulting summary at the bottom of the JSON has 410575 for totalSourceCount { "summary":{ "totalDocumentationCount":62533, "totalDocumentationPercentage":10.743837150966607, "totalEmptyCount":108928, "totalEmptyPercentage":18.71499357428063, "totalFileCount":4546, "totalSourceCount":410575, "totalSourcePercentage":70.54116927475276 } }

Am I incorrect in expecting the summary format Code Sum and the json format totalSourceCount to match?

Thank you

SeanTConrad avatar Jul 14 '23 14:07 SeanTConrad

True, this is an unfortunate inconsistency and lapse in the JSON output.

Background: I always disliked how SLOC tools count even the most trivial pieces of code that add 0 code complexity. That's why pygount never counts lines that only contain "{" in C or pass in Python.

For the same reason, it only reluctantly counts lines that contain nothing but strings. Internally, it collects the following counts, see summary.LanguageSummary. Each lines adds to exactly one of these counts:

  • code_count: contains actual, meaningful code that takes some effort to understand, like variables, function calls, math operations, ...
  • string_count: counts lines, that contain only strings and typical characters to separate them like comma (,). Technically, they are code, but most of the time code wise they are easy to comprehend because it's just some text targeted for the end user. In practice that might not always be true e.g. when cramming strings complex SQL statements into Java code. But I decided this is rare enough to warrant erring on the lower side of complexity from time to time.
  • documentation_count: contain only comments
  • empty_count: contains only white space or language dependent "no operation" code like curly braces, pass, nop,

The --format=summary shows the code_count.

Because I figured that "my" code count might not be popular with everyone, internally there also is:

  • source_count = code_count + string_count: This is pretty close to how e.g. SLOCCount and cloc count code.

The --format=json includes the source_count.

For the record, there also is:

  • line_count = code_count + documentation_count + empty_count + string_count: This is essentially the number of lines the code would show in a text editor or wc -l but always counting the last line, even if it does not end with a new line / carriage return.

To come to a conclusion: Would these two changes resolve your issue?

  • [ ] The JSON output includes code_count.
  • [ ] The documentation section on How pygount counts code includes an explanation of the different counts, especially the difference between code_count and source_count.

roskakori avatar Jul 14 '23 17:07 roskakori

@SeanTConrad If I understand correctly from looking at https://github.com/StartupOS/verinfast/pull/8 you are striving for compatibility with cloc. In that case, the source_count number from the JSON is already the one you are looking for.

Regardless, pygount's inconsistency with --format=summary and the lack of documentation still need to be addressed.

roskakori avatar Jul 14 '23 18:07 roskakori

@roskakori Thank you.

SeanTConrad avatar Jul 17 '23 12:07 SeanTConrad

Sorry @roskakori . I just realized I didn't respond to your earlier message.

If it's the same cost, you could show 3 values for "loc" in both, as you described above. They can be called whatever, as long as it's documented as you said. If I understand you correctly, these are the outputs for "LOC":

  1. code_count - Count of meaningful lines of code
  2. source_count - Count of any lines of code, including single character lines. Similar to "CLOC"
  3. line_count - Count of all lines, akin to wc -l (nice to have for QA, but not necessary for our needs)

For our needs, we are comparing two or more repos for size, complexity, amount of work, etc. I would use "your" measure of code, with it being very important that we are consistent across repos.

Thank you!

tldr: I agree with you and adding code_count to the json would be great.

SeanTConrad avatar Jul 18 '23 18:07 SeanTConrad

@SeanTConrad The JSON format finally contains all available counts, and the documentation describes them.

I also added #152 for the counts to eventually be consistent across formats. This will be a breaking change and thus only be part of future a version 2.0.

roskakori avatar May 13 '24 18:05 roskakori