gridfsmigrate icon indicating copy to clipboard operation
gridfsmigrate copied to clipboard

Moving to filesystem breaks images due to extensions

Open wreiske opened this issue 5 years ago • 7 comments

Cool tool! I just ran it to move files out of GridFS to the local file system. Some images worked, some images didn't work... I spent about a half hour looking at the differences from the working images and the broken images in the rocketchat_uploads collection, as well as on disk.

After uploading a fresh new file, I realized rocket.chat stored the file without an extension in the uploads folder. I tried moving an image I know was broken from image.png to just image, and it started working!

image

Here's a one liner to fix it.

find -type f -name '*.*' | while read f; do mv "${f}" "${f%.*}"; done

wreiske avatar Apr 13 '19 18:04 wreiske

Images with .png .jpg etc in the upload name seem to be broken now. Images that didn't have an extension in the name work....

Saw a diff in the broken vs the worked ones: image

The following fixed it:

db.rocketchat_uploads.update({}, {$set: {extension: ""}}, {multi: true});
WriteResult({ "nMatched" : 4675, "nUpserted" : 0, "nModified" : 1062 })

wreiske avatar Apr 13 '19 18:04 wreiske

Is there an estimate when this will be fixed? I'm going to use this script but I will try to wait until this gets sorted. Thank you.

jcatarino avatar Apr 24 '19 12:04 jcatarino

@jcatarino, I haven't had time to investigat/reproduce the issue yet, but it should be easy to fix, either we could remove the mime detection form the script, or set "extension" to "" during the dbupdate. In case there is urgency, you can write me a mail to discuss the matter

arminfelder avatar Apr 24 '19 12:04 arminfelder

thanks I will postpone this migration until I can test and change the script myself, as im prioritising other tasks atm. If I eventually do it, I will do a PR with it.

jcatarino avatar Apr 24 '19 13:04 jcatarino

My migration went ok. I commented these lines:

else:
    fileext = mime.guess_extension(res.content_type)

jcatarino avatar May 05 '19 13:05 jcatarino

I hacked this together really quick to fix some of the issues we were having after the migration... it may be a bit overkill, but it solved all the cases we were having....

mimet = MimeTypes()
        db = MongoClient(host=self.host, port=self.port)[self.db]
        db = self.getdb()
        uploadsCollection = db["rocketchat_uploads"]

        uploads = uploadsCollection.find({}, no_cursor_timeout=True)
        print(uploads)
        i = 0
        for upload in uploads:
                        fileext = ""
                        filename = upload['_id']
                        mime = magic.Magic(mime=True)
                        split = upload["name"].rsplit('.',1)
                        if len(split) >= 2 and ' ' not in split[1]:
                            print("Got split: %s", split[1])
                            fileext = split[1]
                        else:
                            if os.path.isfile("/app/uploads/" + upload["_id"]):
                                fileext = mime.from_file("/app/uploads/" + upload["_id"])
                            if "identify" in upload and upload["identify"]["format"] != "":
                                fileext = upload["identify"]["format"]
                            else:
                                fileext = mimet.guess_extension(fileext)

                        if fileext is not None and fileext != "":
                            filename = filename + "."+fileext.replace(".jfif", ".jpg")

                        i += 1
                        print("%i. Renaming %s to %s (%s)" % (i, upload['_id'], filename, upload['name']))
                        uploadsCollection.update_one({"_id": upload["_id"]}, { "$set": { "extension": fileext } })
                        if os.path.isfile("/app/uploads/" + upload["_id"]):
                            os.rename("/app/uploads/" + upload["_id"], "/app/uploads/" + filename)

wreiske avatar May 06 '19 00:05 wreiske

I have also experienced problems using this tool to migrate GridFS to FileStore. Some of the images simply doesn't appear as the URL for downloading doesn't work. I think I have worked out the issue.

The entries/rows in the rocketchat_uploads collection has a field extension. It turns out that the file stored on disk has to match this field, otherwise it will not be able to download the files. I've forked and created a feature branch where I have fixed it. When I run this all my pictures and attachments are available after converting. Please see PR #17

Doing this https://github.com/arminfelder/gridfsmigrate/blob/master/migrate.py#L111 does not seem to work, because it might set a suffix other than what the db expects to find.

As a side note, it seems like (but I haven't confirmed it), that newer rocket chat versions does not use any file suffix any more. It only uses the _id field for the filname. My file store has a lot of files with and without suffix. In order to satisfy my desire for clean structure, I made this extra function that I used to rename all files with suffix to the new no-suffix filename scheme and blanking the extension field in the database. After running it on my site it seems like it works too.

    # Put inside class Migrator
    def fixFilenames(self, collection, basedir):

        db = self.getdb()
        uploadsCollection = db[collection]

        uploads = uploadsCollection.find({}, no_cursor_timeout=True)
        i = 0
        for upload in uploads:

            fileext = upload.get('extension') or None
            fileid = upload['_id']

            if not upload['complete']:
                continue

            # Get the real filename by looking for the last path element in 'path',
            # if that file cannot be found, then we're using the ID.
            fnames = upload['path'].split('/')
            filename = fnames[-1]
            if not os.path.isfile(os.path.join(basedir, filename)):
                filename = fileid + '.' + fileext if fileext else fileid

            # Ensure the file is present
            fullfilename = os.path.join(basedir, filename)
            if not os.path.isfile(fullfilename):
                print(f"{filename}: No such file")
                #print(upload)
                continue

            # Does the file have a suffix?
            fsplit = filename.split('.')
            if len(fsplit) > 1 and fsplit[-1]:
                suffix = fsplit[-1]
            else:
                suffix = None

            if suffix != fileext:
                print("{filename}: Suffix mismatches database (expected {fileext})")
                #print(upload)
                continue

            # Skip files without suffix
            if not suffix:
                continue

            i += 1
            print(f"{i}. Renaming {filename} to {fileid}")

            # Rename file
            os.rename(fullfilename, os.path.join(basedir, fileid))

            # Update database
            uploadsCollection.update_one({"_id": upload["_id"]}, { "$set": { "extension": '' } })

    # Add this to the bottom of the file
    if args.command == "renamefiles":
        obj.fixFilenames("rocketchat_uploads", args.destination)

sveinse avatar Mar 08 '21 00:03 sveinse