python-magic icon indicating copy to clipboard operation
python-magic copied to clipboard

Error iterating files in directories

Open ccmn98 opened this issue 2 years ago • 13 comments

When I iterate over files to get their MIME type the magic library is able to provide the MIME type of some of the files and then it will through an error:

File "C:\Users\Chad\AppData\Local\Programs\Python\Python310\lib\site-packages\magic\magic.py", line 89, in from_file return maybe_decode(magic_file(self.cookie, filename)) File "C:\Users\Chad\AppData\Local\Programs\Python\Python310\lib\site-packages\magic\magic.py", line 255, in magic_file return magic_file(cookie, coerce_filename(filename)) File "C:\Users\Chad\AppData\Local\Programs\Python\Python310\lib\site-packages\magic\magic.py", line 196, in errorcheck_null raise MagicException(err) magic.magic.MagicException: b"line I64u: regex error 14 for `^[[:space:]]*class[[:space:]]+[[:digit:][:alpha:]:]+[[:space:]]\{(.[\n])\}(;)?$', (failed to get memory)" PS C:\Users\Chad\Documents\GitHub\Python_practice_scripts>

ccmn98 avatar Mar 21 '23 20:03 ccmn98

Based on the error message it looks like a memory allocation error. Does this happen consistently for one file, or only after running for a while?

ahupp avatar Mar 22 '23 05:03 ahupp

The error occurs when i iterate over files in a directory for a few seconds (approx. 10-15 seconds) and appears to happen at the same file. I've also run my script over the same directory using two different computers, one with 16GB or RAM and the other with 32GB of RAM.

If it's a RAM issue is there a way to clear the RAM of already scanned files so all the data is not stored in the RAM?

ccmn98 avatar Mar 23 '23 11:03 ccmn98

Are you creating a new instance of magic.Magic() for each file, or creating one and re-using it?

ahupp avatar Mar 23 '23 12:03 ahupp

I'm using in a for loop, so the way I have it setup is that I created one instance of it and then re-using it by feeding a file path via a variable.

ccmn98 avatar Mar 23 '23 15:03 ccmn98

Can you share the file that seems to trigger this? Which version of libmagic are you using?

ahupp avatar Mar 23 '23 18:03 ahupp

Here is the snippet of code that is causing the issue. I'm using version: 0.4.14; I'm using Python 3.9.

def file_exists(): for root, dirs, files in os.walk(ROOT):

    for fpath in [osp.join(root, f) for f in files]:

        size = osp.getsize(fpath)
        sha_256 = filehash_sha_256(fpath)
        md5 = filehash_md5(fpath)
        CRDate = osp.getctime(fpath)
        C_Date = datetime.fromtimestamp(CRDate).strftime('%m-%d-%Y')
        C_Time = datetime.fromtimestamp(CRDate).strftime('%H:%M:%S.%f')
        MDate = osp.getmtime(fpath)
        M_Date = datetime.fromtimestamp(MDate).strftime('%m-%d-%Y')
        M_Time = datetime.fromtimestamp(MDate).strftime('%H:%M:%S.%f')

        path = osp.realpath(fpath)
        name = osp.basename(fpath)
        # mime = magic.from_buffer(open(fpath, "rb").read(2048))
        mime = magic.from_file(fpath)
        mime_guess_type = mimetypes.guess_type(fpath, strict=True)
        
        with open(file, "a", newline="") as header_file:
            header = ["File_Name", "File Creation Date", "File Creation Time", "File Modified Date","File Modified Time", "Byte size", "Path", "MIME", "MIME_Guess", "SHA_256", "MD5"]
            writer = csv.DictWriter(header_file, fieldnames=header)

            if not file_exists:
                writer.writeheader()
            writer.writerow(
                {
                    "Byte size": size,
                    "MIME": mime,
                    "MIME_Guess": mime_guess_type,
                    "SHA_256": sha_256,
                    "MD5": md5,
                    "File Creation Date": C_Date,
                    "File Creation Time": C_Time,
                    "File Modified Date": M_Date,
                    "File Modified Time": M_Time,
                    "Path": path,
                    "File_Name": name,
                }
            )

            print(fpath)

ccmn98 avatar Mar 24 '23 11:03 ccmn98

appears to happen at the same file.

Are you able to share the input file that triggers it? What version of libmagic are you using?

ahupp avatar Mar 24 '23 13:03 ahupp

It appears that I did not have python lib-magic installed . . . I tried to install it but the install is a bit problematic. Is the a trick to it?

ccmn98 avatar Mar 29 '23 00:03 ccmn98

If you were running into this error it looks like you do have libmagic installed, that's what produces the error.

ahupp avatar Mar 30 '23 18:03 ahupp

I am also facing the same issue and my implementation is similar to that of @ccmn98. @ccmn98 did you find the resolution for this issue?

stuxnet999 avatar Jul 09 '23 09:07 stuxnet999

Also this issue comes up only when running the script via PowerShell/Cmd. I ran the same code in my WSL and it seems to work completely fine and does not throw the error.

stuxnet999 avatar Jul 09 '23 09:07 stuxnet999

Same error as ahupp/python-magic#276 which was merged into ahupp/python-magic#293.

These input files trigger the issue:

  • https://github.com/ahupp/python-magic/files/9231524/memblock.txt (problematic file attached there)
  • https://github.com/ggerganov/whisper.cpp/blob/3998465/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java (the one I ran across)
  • https://github.com/twbs/bootstrap/blob/v5.2.2/js/src/util/config.js (added later)

Repro in Windows 10 Pro Sandbox:

  1. run powershell -executionpolicy remotesigned
  2. use scoop to install python #v3.11.5
    • iex "& {$(irm get.scoop.sh)} -RunAsAdmin"; scoop install --no-update-scoop git python
  3. use pip to install dependencies
    • pip install python-magic #v0.4.27
    • pip install python-magic-bin #v0.4.14
  4. get the files
    • git clone --depth 1 https://github.com/ggerganov/whisper.cpp.git
    • curl.exe -LO https://github.com/ahupp/python-magic/files/9231524/memblock.txt
  5. run python to repro the issue
    • import magic
    • magic.from_file("whisper.cpp/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java", mime=True)
    • magic.from_file("memblock.txt", mime=True)
error details
PS C:\windows\System32> pip install python-magic
Collecting python-magic
  Using cached python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Installing collected packages: python-magic
Successfully installed python-magic-0.4.27
PS C:\windows\System32> pip install python-magic-bin
Collecting python-magic-bin
  Using cached python_magic_bin-0.4.14-py2.py3-none-win_amd64.whl (409 kB)
Installing collected packages: python-magic-bin
Successfully installed python-magic-bin-0.4.14
PS C:\windows\System32> python
Python 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import magic
>>> magic.from_file("whisper.cpp/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java", mime=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 135, in from_file
    return m.from_file(filename)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 91, in from_file
    return self._handle509Bug(e)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 100, in _handle509Bug
    raise e
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 89, in from_file
    return maybe_decode(magic_file(self.cookie, filename))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 255, in magic_file
    return _magic_file(cookie, coerce_filename(filename))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\magic\magic.py", line 196, in errorcheck_null
    raise MagicException(err)
magic.magic.MagicException: b"line I64u: regex error 14 for `^[[:space:]]*class[[:space:]]+[[:digit:][:alpha:]:_]+[[:space:]]*\\{(.*[\n]*)*\\}(;)?$', (failed to get memory)"
>>>

image

See also:

  • https://github.com/trailofbits/polyfile - a pure Python re-implementation of libmagic with a truckload of dependencies, which seems to also fail to process this input.
  • https://github.com/microsoft/vcpkg/issues/11832 - vcpkg may be able to build libmagic for windows

jspraul avatar Sep 05 '23 02:09 jspraul

Thanks for the repo; this is definitely due to the older version of libmagic shipped with python-magic-bin. Just another case where the binaries situation causes trouble.

ahupp avatar Sep 28 '23 17:09 ahupp