pkglite icon indicating copy to clipboard operation
pkglite copied to clipboard

Guess filetype for files without extensions

Open nanxstats opened this issue 4 years ago • 4 comments

When evaluating file specifications to create file collections, we should follow this:

  • If a file has a known extensions, mark it as text or binary based on the dictionary (implemented)
  • Include files that do not have a file extension, and files with extensions not covered by the dictionary
    • Guess if the file is (canonically) text, otherwise mark them as binary
      • I'd prefer zlib's algorithm: https://github.com/madler/zlib/blob/master/doc/txtvsbin.txt
      • If a file does not have any content, then mark it as binary
  • Document this flow in the specification section

nanxstats avatar Aug 09 '21 04:08 nanxstats

From Yilong: or, simply classify files with unknown extensions as binary files.

nanxstats avatar Nov 16 '21 23:11 nanxstats

The goal is to separate file capture rules and file type tagging rules and make them more universal, instead of limiting both flows with only known file extensions.

Action items:

  • For file capturing: Make some file specifications not file extension-based by removing the file name pattern constraint, e.g., file_inst(), to make them capture arbitrary files.
  • For file type tagging: Revise the tagging strategy by using the file extension dictionary + marking everything else binary.
  • Add file specification functions for more directories observed here: demo/, exec/, po/, build/.

nanxstats avatar Mar 21 '22 21:03 nanxstats

Shall we close the issue?

elong0527 avatar Sep 04 '22 16:09 elong0527

Not yet. This hasn't been shipped.

nanxstats avatar Sep 04 '22 20:09 nanxstats