cloc Count File Size (MB)

Some commercial security scanning tools now charge by file volume in MB. I am wondering if that measurement could be taken and reported. I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).

Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)

Jul 19 '20 12:07 DarwinJS

Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example: Step 1 cloc --by-file --csv --out counts.csv directory Step 2 count_bytes counts.csv where count_bytes is something like

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.

The solution can easily be adapted to count bytes in files after comments are removed. Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory Step 2 count_bytes_no_comments counts.csv where count_bytes_no_comments is

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    $file .= ".No_Comments";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

Jul 19 '20 18:07 AlDanial

There are several benefits to having it integrated:

Integrates MBs on the same report or data output format - so that data can be consumed in the same ways. Having completely separate reports would mean a lot of folks would want to try to merge the reports so they can estimate both of these code metrics in the same way - per repository, per language.
Handles all the files in the same way your base code does (so automatically handling archived files like you are doing)
Allows your code for aggregating reports to be used for MBs as well

I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

Jul 19 '20 19:07 DarwinJS

cloc cloc copied to clipboard

Count File Size (MB)

cloc
cloc copied to clipboard