cloc
cloc copied to clipboard
Count File Size (MB)
Some commercial security scanning tools now charge by file volume in MB. I am wondering if that measurement could be taken and reported. I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).
Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)
Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example:
Step 1 cloc --by-file --csv --out counts.csv directory
Step 2 count_bytes counts.csv
where count_bytes is something like
#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
my $file = (split(','))[1];
next unless $file;
next if $file eq "filename";
if (!-e $file) {
print "can't read $file, skipping\n";
next;
}
$bytes += -s "$file";
}
print "$bytes total bytes\n";
A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.
The solution can easily be adapted to count bytes in files after comments are removed.
Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory
Step 2 count_bytes_no_comments counts.csv
where count_bytes_no_comments is
#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while () {
my $file = (split(','))[1];
next unless $file;
next if $file eq "filename";
$file .= ".No_Comments";
if (!-e $file) {
print "can't read $file, skipping\n";
next;
}
$bytes += -s "$file";
}
print "$bytes total bytes\n";
There are several benefits to having it integrated:
- Integrates MBs on the same report or data output format - so that data can be consumed in the same ways. Having completely separate reports would mean a lot of folks would want to try to merge the reports so they can estimate both of these code metrics in the same way - per repository, per language.
- Handles all the files in the same way your base code does (so automatically handling archived files like you are doing)
- Allows your code for aggregating reports to be used for MBs as well
I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.
I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.