matrix-archive icon indicating copy to clipboard operation
matrix-archive copied to clipboard

sanitize arbitrary room display names before creating directories from them

Open HarHarLinks opened this issue 4 years ago • 5 comments

I ran into this problem with a room name containing a /, but there may be more cases worth preventing. Stuff like simple .. is prevented by parts of the string being hardcoded.

There may be more and i'm still testing this, so WIP for now.

Feel free to modify the style of replaced characters to your liking. I like having a _ where another character would have been.

HarHarLinks avatar Feb 06 '21 23:02 HarHarLinks

This feels like a dangerous change to make because you are losing information in the process. I think it would be better to use something along the lines of base32 or base64 encoding (with some characters replaced in the case of base64).

CicadaCinema avatar Feb 07 '21 01:02 CicadaCinema

this is true, and I had considered this, but I went for this route because it seemed more important being able to browse the downloaded archive human readably. we could just write the unchanged room display name into an additional file?

HarHarLinks avatar Feb 07 '21 13:02 HarHarLinks

Whatever approach is taken, I would argue that there should be absolutely no confusion between the true name and a sanitised form of it. So if you're going to write to a new file for the room name, you shouldn't have an alternative version of the same name anywhere in your archive. It should be clear to the user where the true information lies.

CicadaCinema avatar Feb 07 '21 17:02 CicadaCinema

more notes/points to discuss:

  • Windows/NTFS(?) does not support *, ? and : (probably more, should look up a list) in file/directory names. => keeping the room ID perfectly intact is not an option either.

  • emoji should get their alias in textform (using some lookup, there are probably libs for that) added on to enable better searching

  • the export should be restructured (unless the current flat structure makes more sense for some reason?): ./[chats/]room.display_name_roomid:homeserver/{messages/,avatars/,messages.json,metadata.txt}

HarHarLinks avatar Feb 08 '21 13:02 HarHarLinks

@HarHarLinks Thanks for the contribution but I'm in agreement with @ColonisationCaptain that this would cause unwanted information loss. Rather than having to deal with filesystem peculiarities, I think a better approach is to use a content addressable storage approach. So the filename will be a hash of the file contents and the original filename will be stored in some metadata.

I don't see any advantage either of having separate directories anymore for avatars and other room media so I will likely combine them into a single directory.

russelldavies avatar Feb 10 '21 13:02 russelldavies