cloc icon indicating copy to clipboard operation
cloc copied to clipboard

Count File Size (MB)

Open DarwinJS opened this issue 5 years ago • 4 comments

Some commercial security scanning tools now charge by file volume in MB. I am wondering if that measurement could be taken and reported. I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).

Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)

DarwinJS avatar Jul 19 '20 12:07 DarwinJS

Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example: Step 1 cloc --by-file --csv --out counts.csv directory Step 2 count_bytes counts.csv where count_bytes is something like

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.

The solution can easily be adapted to count bytes in files after comments are removed. Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory Step 2 count_bytes_no_comments counts.csv where count_bytes_no_comments is

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    $file .= ".No_Comments";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

AlDanial avatar Jul 19 '20 18:07 AlDanial

There are several benefits to having it integrated:

  • Integrates MBs on the same report or data output format - so that data can be consumed in the same ways. Having completely separate reports would mean a lot of folks would want to try to merge the reports so they can estimate both of these code metrics in the same way - per repository, per language.
  • Handles all the files in the same way your base code does (so automatically handling archived files like you are doing)
  • Allows your code for aggregating reports to be used for MBs as well

I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

DarwinJS avatar Jul 19 '20 19:07 DarwinJS