ripme
ripme copied to clipboard
Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File
Category
This change is exactly one of the following (please change [ ]
to [x]
) to indicate which:
- [ ] a bug fix (Fix #...)
- [ ] a new Ripper
- [ ] a refactoring
- [ ] a style change/fix
- [x] a new feature
Description
This feature adds support to use Redis as the mechanism for skipping already downloaded URLs. If you use RipMe over a longer period of time, to download many, many galleries and albums, the url_history.txt file gets quite large. Doing an O(n) scan through the entire list for every URL in a job becomes VERY expensive. My own url_history.txt file is approaching 3 million lines and 130 MB. Using Redis speeds up the ripping process considerably AND allows power users the ability to coordinate jobs running across multiple machines on a network.
Users can optionally add the following lines to the rip.properties file:
url_history.redis_cache.host = 192.168.0.123 #IP address or domain name for the redis host
url_history.redis_cache.port = 6379 #redis port (optional, rip me defaults to 6379)
url_history.redis_cache.key_prefix = RipMeURL: #a prefix to give the keys added to redis (optional, will default to an empty string)
If users do not add this configuration, the URL matching algorithm now uses a HashSet. This is memory intensive, but performs faster than the sequential scan.
Note: RipMe will continue to append new lines to the url_history.txt file since this operation does not seem to slow down the job (...at least at the scales that I have encountered)
Note 2: The easiest way to run redis locally is to use docker (Something like docker run --name my-redis -d -p 6379:6379 redis
). Alternatively you could download and install redis for your OS.
Testing
Required verification:
- [ ] I've verified that there are no regressions in
mvn test
(there are no new failures or errors). - [x] I've verified that this change works as intended.
- [ ] Downloads all relevant content.
- [ ] Downloads content from multiple pages (as necessary or appropriate).
- [ ] Saves content at reasonable file names (e.g. page titles or content IDs) to help easily browse downloaded content.
- [ ] I've verified that this change did not break existing functionality (especially in the Ripper I modified).
Optional but recommended:
- [ ] I've added a unit test to cover my change.
what a cool pull request. not that i'd ever need it - but the principle is a great show case :) tried to merge here: https://github.com/ripmeapp2/ripme , but i then wondered how to see within a couple of seconds now and in future if it works. you mind doing a tiny unit test just, maybe in the lines of: https://www.baeldung.com/spring-embedded-redis
@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!
@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!
It seems to works, the only "downside" is that project seems to be abandoned
thank you @SelfOnTheShelf ! 3 tiny things if you could adjust please:
- use latest versions for your dependencies
- if you could add the dependencies to the build.gradle.kts file as well, ripme2 has no maven build any more but gradle
- reorder the Hashset import alphabetic so it would make it merge without conflict into ripme2.