hub-db icon indicating copy to clipboard operation
hub-db copied to clipboard

Missing data files?

Open ckingdev opened this issue 8 years ago • 3 comments

Hi, thanks for putting this together- I'm working on a project investigating the way language is used in porn and subcategories of it, and this is going to be quite helpful!

I'm looking at the raw data, and going by the numbers in filenames it appears that there are quite a few files missing- is there a technical reason for them to be missing, are they merged somehow, or something else? It seems that images_1.json and images_2.json contain all the images from raw_data and nothing more, is that correct?

It's not a problem to re-run the crawling script if that's necessary but I thought I'd check first. If that's the case, I'd be happy to make a PR with the updated data. Thanks for taking the time to put this together, you've saved a lot of work.

ckingdev avatar Apr 22 '17 02:04 ckingdev

Sorry for the slow response:

Huh you're referencing how the numbering starts at page 146, right? I never noticed this before thanks for pointing it out. I don't know if I'm missing data, but if you think I am and want to rerun the crawl, I'm totally fine with that! (also there has been more data added since I crawled this a year ago)

Regarding the images_1.json and images_2.json, GitHub has a cap on individual file size so I had to break the original images.json into two. You should (if I recall correctly) be able to just run cat images_*.json > images.json to get the whole file.

Regarding how images.json was constructed I'm pretty sure I just ran python process/images.json raw_data/*. Probably should have documented that better my bad 🤷‍♀️.

Also I'd note the crawling can be distributed (kind of manually using configs) if you want to speed it up. It'll take a while to run if you don't I think.

cdipaolo avatar Apr 25 '17 07:04 cdipaolo

Aha- I think I just found them! Data folder @9a4b053237.

It looks like there were two branches that, between them, had all the data (I'm going to verify this momentarily, my cli-fu is rusty so I'm just going to write a Python script to check), but they were stored in different folders. See how some of the data is in data/, and some in data/rawdata? I think the next commit removed the data files contained in data/, possibly thinking they were duplicates?

Regardless, I'm going to see if I can't put the whole dataset together- it should be easy to verify that it's all there, the number of unique albums should be roughly 36 times the number of pages (probably some differences especially in the later pages due to rankings changing between scrapes). But it will be good enough for me if I can get the data in the first pages that are missing- I don't care much about albums with few views and I'm guessing the distribution on number of views per album is very long tailed.

I'll check back in later, if this data is what it appears to be I'm in luck and i'll make a pull request. Really appreciate the response, and no worries on the delay, I hadn't realized this project hadn't been active in over a year.

ckingdev avatar Apr 25 '17 21:04 ckingdev

Huh interesting!

I'll run the processing scripts and update the files sometime soon (before the end of the weekend). Thanks a ton for finding this!!

cdipaolo avatar Apr 27 '17 10:04 cdipaolo