pywebcopy icon indicating copy to clipboard operation
pywebcopy copied to clipboard

How to clone linked pages?

Open rstmsn opened this issue 4 years ago • 15 comments
trafficstars

I'm running the following example. The target page downloads OK, however none of the linked pages are being downloaded. Is there a configuration flag I can set, to download hyperlinked pages?

`from pywebcopy import save_website

kwargs = {'project_name': 'xxxx-clone-eb'}

save_website( url='https://xxxxx.com/ARTICLES/xxxx.htm', project_folder='/Users/xxxx/Documents/Code/xxxxx/EB', **kwargs ) `

rstmsn avatar Dec 23 '20 14:12 rstmsn

Are the hyperlinked pages hosted on the same site domain or outside?

rajatomar788 avatar Dec 23 '20 16:12 rajatomar788

On the same domain. Currently, it doesn't seem to be following any hyperlinks. In my project_folder, i'm only seeing 1 .html file, despite the fact that this page contains many linked pages.

rstmsn avatar Dec 23 '20 17:12 rstmsn

The pywebcopy builds a hierarchical structure meaning your hyperlinked pages might be in some folders relative to the main html file.

rajatomar788 avatar Dec 24 '20 07:12 rajatomar788

No. There is only 1 .html file, no other folders, no other files. When clicking a hyperlink within the .html file, it 404s because the package has not followed / downloaded any of the hyperlinks. Why would this be?

rstmsn avatar Dec 24 '20 08:12 rstmsn

facing the same issue, any updates?

unmurphy avatar Jan 06 '21 06:01 unmurphy

It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.

rajatomar788 avatar Jan 06 '21 12:01 rajatomar788

It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.

server side is fine. I just successfully cloned the same site manually, using wget. For others who might benefit from this code-free solution:

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains example.org --no-parent www.example.org/directory/

rstmsn avatar Jan 06 '21 15:01 rstmsn

@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?

unmurphy avatar Jan 07 '21 06:01 unmurphy

@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?

recursive find & replace using grep?

rstmsn avatar Jan 07 '21 07:01 rstmsn

Any update on this? Facing the same issue.

monim67 avatar Sep 16 '22 16:09 monim67

Ok just use 'save_website' function instead of save_webpage

rajatomar788 avatar Sep 16 '22 16:09 rajatomar788

Ok just use 'save_website' function instead of save_webpage

The issue is with the save_website function itself. It's downloading a single page just like save_webpage. I'm using pywebcopy 7.0.2 on macOS.

monim67 avatar Sep 16 '22 19:09 monim67

Does anyone here knows how to crawl a whole subdomain? Currently trying to test something out.

BradKML avatar Apr 02 '23 16:04 BradKML

@BrandonKMLee you can modify the session object which is created in the save_website function to discard any unwanted domains. You should see the source code of save_website function and then just replace the default session with a modified one.

rajatomar788 avatar Apr 03 '23 02:04 rajatomar788

Let's say there are these three scenarios:

  1. You want to scrape "'https://www.nateliason.com/notes*" but nothing else around "https://www.nateliason.com", and that all child URLs of "'https://www.nateliason.com/notes/{pages}" are traceable to the parent page
  2. You want to scrape the articles within "https://paulminors.com/resources/book-summaries" but the URLs are shortened in there, and the site also have other unrelated articles
  3. You want to scrape the whole website (any) but not any other domain

I am trying to figure out which value to change within Session since dunno how it is tied to save_website

BradKML avatar Apr 03 '23 04:04 BradKML