ripme
ripme copied to clipboard
When ripping reddit user: ripme is removing trailing "_" from user folder name
Example:
Cmd line: java -jar ripme.jar -u "https://www.reddit.com/user/XXXXXXXXXXXX_" Save Folder: \rips\reddit_user_XXXXXXXXXXXX
If someone would like specific examples please let me know. I don't want to post someone's account on GH.
Weird, after taking a quick look at RedditRipper.java
I don't see what could be causing it
public String getGID(URL url) throws MalformedURLException { // User Pattern p = Pattern.compile("^https?://[a-zA-Z0-9\\.]{0,4}reddit\\.com/(user|u)/([a-zA-Z0-9_\\-]{3,}).*$"); Matcher m = p.matcher(url.toExternalForm()); if (m.matches()) { return "user_" + m.group(m.groupCount()); }
The regex seems to include names with _ in them but I don't understand it well enough to say for sure (And the code lacks comments)
So I just did some more testing, and it turns out that if there is a leading "_" in the username ripme is omitting that as well in the user folder... I have tried on both Lubuntu 16.04 and Windows 10 and have seen the same thing.
I think it has to do with how folder names are normalized (which includes replacing non-ASCII characters with _ and turning runs of resulting _ into a single _). I'm not sure where that code is exactly but I'll take a look. The actual name of the output directory doesn't affect whether the rip completed successfully but since there's semantic meaning in the actual spelling of a user's name it would be good if we didn't throw that information away. Unfortunately, this will mean removing the compaction of multiple _ in a row which can be helpful when there are a lot of "invalid" characters being replaced by _.
I'm thinking using less underscores would be appropriate when a single character would equal multiple underscores, which would be the case with languages like Japanese or Chinese.
Otherwise I wouldn't use less underscores unless that becomes a problem.
@rautamiekka I agree.
I'll point out that this behavior has not been changed in a long time, and since it's not a regression and there are many more fundamentally broken things, this is a bit low on the priority list. Sorry for any inconvenience!
Completely understood. Thank you!
@rautamiekka - some further thoughts: I think whenever you end up with a site with a lot of non-ASCII characters, you'd end up with folders with a ton of underscores -- not exactly easy to read. Compressing them makes sense, however these underscores conflate with actual underscores in URLs and usernames. I think what we should really be doing is translating the unknown characters to some form of blank besides underscores (perhaps NULL 0x00 because Java strings allow nulls and they would (probably) never appear in a title of a page and even if they did, they should get converted to underscores along with everything else which is non-printable), compressing those to a single blank per run of blanks (which would preserve actual underscores), and then finally rendering all blanks as underscores. That way, "\0\0\0_\0\0\0" -> "\0_\0" -> "___" (three '_' s)
.
The goal is to preserve real underscores (even at beginning and end) and compress everything else to a single placeholder character. I'm not sure I like trimming away the blanks at the beginning and end either.
It would be important to note, a change like this would be a breaking change for re-rips because re-rips determine whether to re-rip a file based on the existence of a file at the path to save to. Not sure how many people are strongly relying on that. In my case, because the inability to rename files and then re-rip without duplicates doesn't exist, I already have to rely on a de-duplication script at times. However, on the occasions where I avoid renaming files so that re-rip will keep working, a change to the format would break things.
I think changes to the output file path formatting scheme should come along with features like smarter re-rips, inference of whether something has already been ripped by the existence of files with the image ID in the filename. I think that pack of functionality would be suited to include in a minor version bump to 1.5 once the work is complete.
^ Yes, null internally would work well.
I'm relying on RipMe and the OS to do the deduping by using the full gallery link so that everything is in a single folder. Depending on how the gallery/favs is organized, it could externally be double, triple or even bigger size due to having same item appear in multiple folders, which RipMe happily downloads.
Using symlinks this organization wouldn't be such a problem, but symlinking on Window$ requires Window$ >=VI$TA for file symlinks and Admin, which doesn't leave too many options, and symlinking eats how many objects you can have on the partition, so imagine someone having >=10k uploads, the description files are saved and symlinks are created at least once for each upload and maybe each description file ...
Not a pretty picture.