urldedupe
urldedupe copied to clipboard
Pass in a list of URLs with query strings, get back a unique list of URLs and query string combinations
### jsfiles.txt ``` https://www.test.com/js/0-0c6e5e47ca6a3f3f7243.js https://www.test.com/js/0-249c4f63764b90e95f29.js https://www.test.com/js/0-356c7b1d95f2143f6cd2.js https://www.test.com/js/0-5adfe0ed1f01b27b5f5f.js https://www.test.com/js/0-6553d716c12f03bb710d.js ``` ``` cat jsfiles.txt | urldedupe -s ``` expected output should be: https://www.test.com/js/0-0c6e5e47ca6a3f3f7243.js
hacker@localhost:~/tools/urldedupe$ make Scanning dependencies of target urldedupe [ 20%] Building CXX object CMakeFiles/urldedupe.dir/Url.cpp.o /home/hacker/tools/urldedupe/Url.cpp:4:10: fatal error: filesystem: No such file or directory #include ^~~~~~~~~~~~ compilation terminated. CMakeFiles/urldedupe.dir/build.make:120: recipe for target...
Instead of producing a `string` for each URL to use as a key, it is sensible to produce a hash. I pulled in the SpookyV2 hash library, which is reasonably...
There are some use cases that have been brought up for deduping large files (> 10gb). This will result in a crash if the system does not have enough RAM...
Probably makes sense to discard ports when assessing for duplication, but account for something like: ``` https://site.com:443/home https://site.com/home ```