detect-file-type
detect-file-type copied to clipboard
.docx file is being detected as a zip file
It IS a zip file. In order to properly detect it, you need to parse the whole file and look at the filenames. Parsing zip entries means reading the end of the file, going back until a certain "magic" is found, find offsets and jump to the entry directory and parse it then. As it requires random access to the file, it's not really suitable for this library in my opinion.
It IS possible though to introduce a second optional phase where you pass a full path to the file (or file descriptor), and detect zip-based types. Other libraries do this by parsing the beginning of the file, hoping that there will be a "backup" entry name in the beginning in the small header chunk that is read. This is unreliable though and has more misses than hits when I tested it.