common-words icon indicating copy to clipboard operation
common-words copied to clipboard

Strip comments

Open DJTB opened this issue 8 years ago • 2 comments

Hey hey, I love what you've done here!

It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments. Would be great to see a dataset that doesn't include comments.

DJTB avatar Jan 21 '17 14:01 DJTB

Thank you!

I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.

Or maybe there is an easier way?

anvaka avatar Jan 21 '17 22:01 anvaka

I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment

DJTB avatar Jan 22 '17 02:01 DJTB