DiscordChatExporter
DiscordChatExporter copied to clipboard
`--reuse-media` creates an unacceptably high chance of a collision
Version
v2.42.8
Flavor
GUI (Graphical User Interface), CLI (Command-Line Interface)
Platform
Linux
Export format
HTML, TXT, JSON, CSV
Steps to reproduce
Export servers where large numbers of videos, images and especially screenshots are often posted. Observe that files named image.png or unknown.png are extremely common due to copy-paste of images directly into Discord, often leaving no file name available.
Do math to confirm the birthday problem is at play and will be causing you problems.
Details
The algorithm for making file names for downloaded media only stores 20 bytes of the hash of the URL. This is insufficient.
Despite there being 1048576 possible hashes, it only takes 1200 images for there to be a 50% chance of a collision and 500 images before there is a 10% chance of a collision due to the birthday problem. The original reasoning in the issue where this code was added (#395) is incorrect due to the commonness of unknown.png, image.png, and other filenames.
A list of all filenames I've seen that could reach that 10% chance of a collision in just my own exports of servers I'm in:
508 hqdefault.jpg
541 image0.png
681 sddefault.jpg
1269 image0.jpg
1696 tenor.png
5798 maxresdefault.jpg
6637 image.png
11314 unknown.png
As can clearly be seen, file names are not evenly distributed. Notably, maxresdefault, sddefault and hqdefault are not being posted intentionally by users at all, and are instead the result of youtube thumbnails. Similarly, tenor happens because the tenor gif button is pressed in Discord's own UI.
There is a near 100% chance that there are collisions here between unknown.png, image.png, and maxresdefault.jpg. You can check with this calculator: https://www.bdayprob.com/. Solve P(D,N) with D = 1048576 and N = 6637.
10 characters provides a better margin of safety: 15000 images with the same name before there is even a 0.01% chance, which would have successfully handled my use case.
10 characters of base-32 rather than base-16 would provide 600000 images before there is the same chance of collision. 8 characters of base-32 would provide the same margin as the 10 characters of base-16.
Checklist
- [X] I have looked through existing issues to make sure that this bug has not been reported before
- [X] I have provided a descriptive title for this issue
- [X] I have made sure that this bug is reproducible on the latest version of the application
- [X] I have provided all the information needed to reproduce this bug as efficiently as possible
- [ ] I have sponsored this project
- [ ] I have not read any of the above and just checked all the boxes to submit the issue
An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide).
Pasted from https://github.com/Tyrrrz/DiscordChatExporter/pull/1232
I have a single channel with 8774 conflicts. This command counts how many times each filename occurred:
❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txtThis command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc 17453Then count the number of filenames that occur more than once:
❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l 867917453-8679 = 8774
I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally. I've checked a different channel too and the same thing happened. I can't remember if it's the default or not, but I store each channels images in its own folder
❯ ls -1U | head -10 670866322619498511-media 670866322619498511.json 704107851421057114-media 704107851421057114.json 704226060178292846-media 704226060178292846.json 862221868621758502-media 862221868621758502.json 862732083050971146-media 862732083050971146.jsonIt would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.
Here's a set of images that end up with the same file names: unknown-F682B.png:
https://cdn.discordapp.com/attachments/98609319519453184/760954915362832414/unknown.png?ex=66aa987d&is=66a946fd&hm=3c3c51658e86c5f906cb4e22c78258836c4e1f3144086c5a7f1a261ced04c8e5&
https://cdn.discordapp.com/attachments/98609319519453184/1001287339462627408/unknown.png?ex=66aa3040&is=66a8dec0&hm=dfdf80af98d7fdb373cdb3591e1d8e106312200f1f93d2548f56fd7e185d3226&
unknown-A4DB4.png:
https://cdn.discordapp.com/attachments/98609319519453184/352621773524697090/unknown.png?ex=66aa385e&is=66a8e6de&hm=2eb1fd389f5522e18ec85a6a52e32fe3c2cfb364ac4daa1d4979405109c8a547&
https://cdn.discordapp.com/attachments/98609319519453184/826940484042620938/unknown.png?ex=66aab557&is=66a963d7&hm=0d7a7c51108596c84207a29b9d5b0b5f19503b7a43d92d83306c37da58825a5a&
unknown-F1191.png:
https://cdn.discordapp.com/attachments/98609319519453184/342358582458712064/unknown.png?ex=66aa74c6&is=66a92346&hm=dc2d387a9da0597b3a2668491b2b4a42032aafde351268c883d498ac562cfbaf&
https://cdn.discordapp.com/attachments/98609319519453184/532705112649629707/unknown.png?ex=66aa228c&is=66a8d10c&hm=a2e4e9cb280a8978aad52611b1f59276748815d6163e78ddf88c4cc8ae463fcd&
https://cdn.discordapp.com/attachments/98609319519453184/1024319357066682440/unknown.png?ex=66aa433d&is=66a8f1bd&hm=c51fc893d150ef8946d843c416c217816beed1cea4c7e30a7d92ceca86e7b53d&
unknown-7DBD3.png:
https://cdn.discordapp.com/attachments/98609319519453184/187127175315456002/unknown.png?ex=66aaa503&is=66a95383&hm=0dca9934bad285ecba5e5499b73399093056dffdf3ac43db3e3799c18f12250d&
https://cdn.discordapp.com/attachments/98609319519453184/418214803568328713/unknown.png?ex=66aa3923&is=66a8e7a3&hm=51ea7853501a6263624fe5830bf329b9aa9e72f5fb3ad584788e249233250552&
hmm that's concerning, considering I use this on bigger discord as well where collisions would probably be likely