ripme
ripme copied to clipboard
Have Volafile Ripping implemented, question about dependencies...
Alright, so for the past two days I've been basically teaching myself java specifically for the purpose of implementing Volafile.io ripping capability to Ripme.
The problem with enabling Volafile ripping is that it uses javascript to dynamically populate the data in the file column. This means that Ripme's built-in HTTP grabber doesn't work for volafile, because it can't see the javascript generated content.
So, I implemented Selenium and PhantomJS, to open the page, wait until the javascript has loaded the content, then output the page source to a variable and parse it as normal.
I now have it up and working, grabbing all the links and downloading them successfully. Here's the rub: Because this method requires PhantomJS, all of the users will require PhantomJS to be on their computers in the 'standard' location (Usually on the C drive, in Program Files).
I'm ready to put this up on my fork and submit a pull request, but how can I package PhantomJS with the project so that people don't have to follow a tutorial to get PhantomJS 'installed' and available to be used? Is there a way to have Java download and unpack the PhantomJS Executable in a location relative to the ripme.jar?
A much better option would be to integrate JavaScript into RipMe.
Perhaps so. Right now Ripme uses Jsoup, which is used for page parsing, but it does not support javascript. Selenium's 'HTMLUnitDriver' does support javascript, but it was throwing errors and I couldn't figure them out. I'm doing the best I can with what I've got...
Maybe they could be fixed together.
Is there any way to get the list of files directly via the json (or otherwise) endpoints that the page downloads to populate the page dynamically?
I agree that adding Selenium and PhantomJS to ripme is overkill. Volafile uses WebSockets as a data exchange. WS will be a challenge in itself to implement, but this approach would be lightweight in comparison.
The below WS address should be established:
wss://volafile.io/api/?rn=xxxxxxxxxx&EIO=3&transport=websocket&t=xxxxxxx
There are four query parameters in the exchange to establish a connection to the WS:
-
rn - As far as I can tell, this is just a random ten character string matching the regex
/[A-Za-z0-9]/
. Probably to prevent caching or bots. - EIO - Static value.
- transport - Static value.
-
t - This is definitely increasing sequentially, i'm just not sure how its being calculated. I assume its based on milliseconds. I'm not convinced this value is actually important either though. I think this can be any random value as long as its unique. I'd be curious to see if just returning the time in milliseconds would be enough.
- Lfyns51
- LfynsgD
- LfyntBZ
- LfynthI
- Lfynu6D
- LfynuTF
- LfynuTF
- LfynvFD
Once the WS session is established, we need to make a request to the WS which will require the following (again, as far as i can tell looking pretty quick):
- Room ID - Parsed from URI, no problem.
-
Checksum #1 - This requires Checksum #2 first. After thats parsed, a request can be made to
https://volafile.io/static/js/main.js?c=<CHECKSUM #2>
, in the JS file there is a{config.checksum="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}
value that will need to be parsed as well. - Checksum #2 - This can easily be pared from the HTML on the initial page load - See [A]
[A]
window.config={
file_max_size: 21474840000,
max_room_name_length: 27,
chat_max_alias_length: 12,
chat_max_message_length: 300,
chat_max_history: 300,
file_time_to_live: 172800,
session_lifetime: 604800,
download_cookie_lifetime: 259200,
round_up_threshold: 0.2,
max_concurrent_uploads: 1,
ui_tooltip_show_delay: 200,
ui_enable_gallery: true,
disabled: false,
private: true,
name: "xxxxxxxx",
owner: "xxxxxxxx",
motd: "xxxxxxxx",
adult: false,
checksum2: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
room_id: "xxxxxx",
title_append: " - Volafile.io ",
domain: "volafile.io",
cdn_domain: "volafile.net"
}