gridfsmigrate
gridfsmigrate copied to clipboard
Moving to filesystem breaks images due to extensions
Cool tool! I just ran it to move files out of GridFS to the local file system. Some images worked, some images didn't work... I spent about a half hour looking at the differences from the working images and the broken images in the rocketchat_uploads collection, as well as on disk.
After uploading a fresh new file, I realized rocket.chat stored the file without an extension in the uploads folder. I tried moving an image I know was broken from image.png to just image, and it started working!
Here's a one liner to fix it.
find -type f -name '*.*' | while read f; do mv "${f}" "${f%.*}"; done
Images with .png .jpg etc in the upload name seem to be broken now. Images that didn't have an extension in the name work....
Saw a diff in the broken vs the worked ones:
The following fixed it:
db.rocketchat_uploads.update({}, {$set: {extension: ""}}, {multi: true});
WriteResult({ "nMatched" : 4675, "nUpserted" : 0, "nModified" : 1062 })
Is there an estimate when this will be fixed? I'm going to use this script but I will try to wait until this gets sorted. Thank you.
@jcatarino, I haven't had time to investigat/reproduce the issue yet, but it should be easy to fix, either we could remove the mime detection form the script, or set "extension" to "" during the dbupdate. In case there is urgency, you can write me a mail to discuss the matter
thanks I will postpone this migration until I can test and change the script myself, as im prioritising other tasks atm. If I eventually do it, I will do a PR with it.
My migration went ok. I commented these lines:
else:
fileext = mime.guess_extension(res.content_type)
I hacked this together really quick to fix some of the issues we were having after the migration... it may be a bit overkill, but it solved all the cases we were having....
mimet = MimeTypes()
db = MongoClient(host=self.host, port=self.port)[self.db]
db = self.getdb()
uploadsCollection = db["rocketchat_uploads"]
uploads = uploadsCollection.find({}, no_cursor_timeout=True)
print(uploads)
i = 0
for upload in uploads:
fileext = ""
filename = upload['_id']
mime = magic.Magic(mime=True)
split = upload["name"].rsplit('.',1)
if len(split) >= 2 and ' ' not in split[1]:
print("Got split: %s", split[1])
fileext = split[1]
else:
if os.path.isfile("/app/uploads/" + upload["_id"]):
fileext = mime.from_file("/app/uploads/" + upload["_id"])
if "identify" in upload and upload["identify"]["format"] != "":
fileext = upload["identify"]["format"]
else:
fileext = mimet.guess_extension(fileext)
if fileext is not None and fileext != "":
filename = filename + "."+fileext.replace(".jfif", ".jpg")
i += 1
print("%i. Renaming %s to %s (%s)" % (i, upload['_id'], filename, upload['name']))
uploadsCollection.update_one({"_id": upload["_id"]}, { "$set": { "extension": fileext } })
if os.path.isfile("/app/uploads/" + upload["_id"]):
os.rename("/app/uploads/" + upload["_id"], "/app/uploads/" + filename)
I have also experienced problems using this tool to migrate GridFS to FileStore. Some of the images simply doesn't appear as the URL for downloading doesn't work. I think I have worked out the issue.
The entries/rows in the rocketchat_uploads
collection has a field extension
. It turns out that the file stored on disk has to match this field, otherwise it will not be able to download the files. I've forked and created a feature branch where I have fixed it. When I run this all my pictures and attachments are available after converting. Please see PR #17
Doing this https://github.com/arminfelder/gridfsmigrate/blob/master/migrate.py#L111 does not seem to work, because it might set a suffix other than what the db expects to find.
As a side note, it seems like (but I haven't confirmed it), that newer rocket chat versions does not use any file suffix any more. It only uses the _id
field for the filname. My file store has a lot of files with and without suffix. In order to satisfy my desire for clean structure, I made this extra function that I used to rename all files with suffix to the new no-suffix filename scheme and blanking the extension
field in the database. After running it on my site it seems like it works too.
# Put inside class Migrator
def fixFilenames(self, collection, basedir):
db = self.getdb()
uploadsCollection = db[collection]
uploads = uploadsCollection.find({}, no_cursor_timeout=True)
i = 0
for upload in uploads:
fileext = upload.get('extension') or None
fileid = upload['_id']
if not upload['complete']:
continue
# Get the real filename by looking for the last path element in 'path',
# if that file cannot be found, then we're using the ID.
fnames = upload['path'].split('/')
filename = fnames[-1]
if not os.path.isfile(os.path.join(basedir, filename)):
filename = fileid + '.' + fileext if fileext else fileid
# Ensure the file is present
fullfilename = os.path.join(basedir, filename)
if not os.path.isfile(fullfilename):
print(f"{filename}: No such file")
#print(upload)
continue
# Does the file have a suffix?
fsplit = filename.split('.')
if len(fsplit) > 1 and fsplit[-1]:
suffix = fsplit[-1]
else:
suffix = None
if suffix != fileext:
print("{filename}: Suffix mismatches database (expected {fileext})")
#print(upload)
continue
# Skip files without suffix
if not suffix:
continue
i += 1
print(f"{i}. Renaming {filename} to {fileid}")
# Rename file
os.rename(fullfilename, os.path.join(basedir, fileid))
# Update database
uploadsCollection.update_one({"_id": upload["_id"]}, { "$set": { "extension": '' } })
# Add this to the bottom of the file
if args.command == "renamefiles":
obj.fixFilenames("rocketchat_uploads", args.destination)