xA-Scraper
xA-Scraper copied to clipboard
NewGrounds embeds not captured
Just now noticed, but only the initial pic in a 'series' post is captured, artists often will include extra pictures in the submission's "description". (Unfortunately these pictures are of lower resolution than the initial, so often artists ALSO put HQ links to them pointing to imgur.com or files.catbox.moe ...)
Can you e-mail me a instance of this? I hadn't seen that particular gallery structure.
I'm not sure what to do about general embedded stuff (this is an issue with patreon too). Scraping things like youtube and imgur is a substantially complicated task, and it's a whole lot of work I'd like to avoid.
I've throught about trying to do something like use jdownloader as an external tool for this sort of thing. Right now, I just ignore external links.
I emailed you an example.
I've throught about trying to do something like use jdownloader as an external tool for this sort of thing. Right now, I just ignore external links.
Are they included in the database? It would be nice to be able to comb the database for a certain type of link I know a CLI tool or JDownloader can handle.
I mean, I try to save the contents of any text description, so.... maybe?
A lot of it is hard because it's basically done with freeform text input.
It'd be a pretty easy bit of SQL to dump every description from a specific user to a csv file for further poking, if you want.
\copy (
SELECT
content,
content_structured
FROM
art_item
WHERE
artist_id IN (
SELECT
id
FROM
scrape_targets
WHERE
artist_name = 'artist-name-here'
)
)
TO '~/Downloads/export.csv' CSV HEADER
CSV files are definitely workable, I just have to do regex matches, really.
I assume there's no easy way to get a separate csv file for each individual artist without doing some Python scripting?
I assume there's no easy way to get a separate csv file for each individual artist without doing some Python scripting?
The above query is for a single artist?
You could either do python stuff, or use a bash script to dump to multiple files. The actual query can be one line, and you can pass a query and database to psql
. Realistically, unless I was doing it regularly, I'd probably just hand-munge the queries in a bash script.
Possibly relevant: https://stackoverflow.com/questions/43295406/how-to-copy-to-multiple-csv-files-in-postgresql
The above query is for a single artist?
Yes I know, but doing hundreds of artists by hand isn't exactly ideal.
Realistically, unless I was doing it regularly...
That was the idea. I'll probably automate it somehow, but a lot less competently.
Possibly relevant: https://stackoverflow.com/questions/43295406/how-to-copy-to-multiple-csv-files-in-postgresql
Aha, perfect, a for-loop.
Whoops, didn't mean to close the entire issue.
Also, now:
print(" dump [export_path] [sitename]")
print(" Dump the database contents for users from a specific site to [export_path]")
Oh nice thanks. I had a syntax error with the script you pasted above that I was going to ask you about, but this is a lot better. And since it's JSON, I can more easily iterate over this with jq.
I did manage to learn how to properly get a shell script for git-bash from Git for Windows, though. Turns out I'm probably still better off using PowerShell, but it could come in handy later.