pywebcopy program hangs and does not exit

program hangs and does not exit

Open youngblood opened this issue 4 years ago • 28 comments

Trying Examples 1 & 2 from the "How to - Save Single Webpage" section in readme.md, as well as method 3 from examples.py. Using python 3.7, pywebcopy 6.3, and one of the example URLs from example.py: 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'

Issues: Method 1 & 2 hang every time. Method 3 appears to be deprecated. Nothing appears in my log_file with this approach, so difficult to troubleshoot further. And the join_timeout setting doesn't appear to have any effect.

Based on the other open issue (#35 ), I also included the thread-closing loop from examples.py.

Files are downloading, but when I try to open the main HTML file it never shows any of the images (perhaps it never got to the point of saving them?).

My code, modified from examples:

import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
project_folder = '/Users/user/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/user/Downloads/scraped_content/pwc_log.log',
	join_timeout=30
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - This one also hangs every time
wp = pywebcopy.WebPage()
wp.get(project_url)
wp.save_complete()
wp.shutdown()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

for thread in threading.enumerate():
    if thread == threading.main_thread():
        continue
    else:
        thread.join()

print("Execution time : ", preferred_clock() - start)```

Apr 23 '20 23:04 youngblood

Value of join_timeout is applied to each thread, and you have set it to 30, so its most likely waiting on each thread for 30 seconds and looking like it froze.

Apr 24 '20 06:04 rajatomar788

The log_file parameter is removed due to flushing errors. See #36.

Apr 24 '20 06:04 rajatomar788

Well I let it run (Method 2 above) for hours overnight last night, with just that one URL to scrape, but it was still stuck at the same place this morning:

Then I set join_timeout to 5 and tried again, still with just the one codeburst.io URL, and got the same result.

Then I tried with two other URLs: https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html and http://www.history.com/topics/cold-war/hollywood-ten

So then I tried using Method 2 on a list of 3 URLs, just to see if it would get past the first one. It didn't. It still hangs on the first url in the list:

# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
# project_url = 'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html'
# project_url = 'http://www.history.com/topics/cold-war/hollywood-ten'
project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
	join_timeout=5
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - I made this one up based on what I pieced together.
urls = [
	'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
	'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
	'http://www.history.com/topics/cold-war/hollywood-ten'
]

for url in urls:
	wp = pywebcopy.WebPage()
	wp.get(url)
	wp.save_complete()
	wp.shutdown()
	for thread in threading.enumerate():
	    if thread == threading.main_thread():
	        continue
	    else:
	        thread.join()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

print("Execution time : ", preferred_clock() - start)```

Apr 24 '20 14:04 youngblood

The internal code is changed much without changing the examples.py, so I would do it this way


from pywebcopy import WebPage
from pywebcopy import config


def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

Apr 25 '20 04:04 rajatomar788

I'll try that one next - thank you for your help!

Apr 27 '20 00:04 youngblood

Issue should be fixed. I am closing it.

May 04 '20 01:05 rajatomar788

yeah @rajatomar788 it hangs now and then!

So im just looping through a list of websites, and calling scrape method as suggested by you above.

Nevertheless, it usually hangs with this log: Queueing download of asset files.

from pywebcopy import WebPage
from pywebcopy import config

def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

def main():
	output_folder = "jy_scrape"

	links = initialize_list("Location to txt file")

	for link in links:
		try:
			scrape(link, output_folder)
		except Exception:
			continue

if __name__ == '__main__':
	main()

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

I temporarily stopped the hanging by changing download_file to be not multi-threaded

Jun 02 '20 17:06 junyango

@junyango program hanging could be result of many factors, low ping could be one of them. If you have used the code as above then you should definitely check the ping.

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

Yes you could use any implementation, whichever works for you.

I temporarily stopped the hanging by changing download_file to be not multi-threaded

How did you do it?

Jun 04 '20 06:06 rajatomar788

@rajatomar788 did it hang for your case? Under the save_assets method in webpage.py, i just changed

 for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

yeah. this temporarily stops the problem from occurring since elem.run calls download_file() which does its own session.get implementation.

because i realized in your implementation, the joining of threads is actually done in shutdown() method, but having tried that it still doesnt work. It gets stuck at "Queuing download of assets", so i tried to isolate the problem and found that it was the code above that was causing the problem. Now im running, albeit somewhat slower, but works pretty reliably i think.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

Jun 04 '20 06:06 junyango

@junyango your implementation just undoes the entire parallel downloading capabilities of the pywebcopy.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

of course single threading is error proof. If you are not heavy-lifting images filled pages then it should be good for you.

Jun 04 '20 07:06 rajatomar788

@rajatomar788 Yup, wanted to get a working version up and running. However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

Anyway, would like to thank you for the prompt replies! :)

Jun 04 '20 07:06 junyango

@junyango

However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

If there use case is simple then there shouldn't be any problem. But for the special cases, this thread will help them.

Regards.

Jun 04 '20 07:06 rajatomar788

I'm still seeing this behavior. I've tried using the newer approach, but to no avail. I have discovered a few things, though. When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

However, those errors do not appear when using threading. Instead, the join() method gets called and no error messages get written.

It seems like there's a problem with how threads deal with assets that do not exist, though that's not the full story. When the head only contains non-existent references, the script completes. I don't understand threading enough to make sense of this.

I've narrowed this down to the simplest possible failure. On a local server, if I try to save_complete() on a site with a single index.html page that contains the markup below (but without any of the assets), the script does not complete.

However, if I comment out any single line in the

tag (or I add the missing assets), the script completes and exists.

<html>
  <head>
    <meta charset="utf-8">
    <link href="https://use.fontawesome.com/releases/v5.0.8/css/all.css" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Roboto+Slab:700" rel="stylesheet">
    <link name="first-static-path-to-nothing" href="/static/path/to/first/nothing.png" >
    <link name="second-static-path-to-nothing" href="/static/path/to/second/nothing.png" >
</head>
  <body>
    Here comes everybody.
  </body>
</html>

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

from pywebcopy import WebPage
from pywebcopy import config

url = "http://webdowntest.local"

def scrape(url, folder, timeout=1):

    config.setup_config(url, folder, debug=True)

    wp = WebPage()
 
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)
           print(f'AFTER JOIN: {t.name}')
    
    active_threads = []
    for i in wp._threads:
        if i.is_alive():
            active_threads.append(i)

    # location of the html file written
    return [wp.file_path, active_threads]

a = scrape(url, 'webdown2')
print(a)

Jun 13 '20 23:06 deleuzer

@deleuzer

When using the non-threading approach, see a number of unrecognized response errors in the debug logs.
elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

However, if I comment out any single line in the
tag (or I add the missing assets), the script completes and exists.

You can try a few things like -

check your internet connections ping
check if the files you have added as assets do exist through a browser.
try setting the timeout to 0 or 0.001 seconds.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

Jun 14 '20 03:06 rajatomar788

@deleuzer
elements - ERROR - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]
As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

Obviously, that's why I pointed out the difference between non-threading and threading.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like - You can try to reproduce the problem following the very easy steps I gave you. That's why I offered such a robust comment with the simplest implementation of the problem.
1. check your internet connections ping

Like i said, I'm querying a local server. As for the remote files, I can 'wget' them without a problem, use the python requests module to get them just fine, and download them without threading with this very package. The problem arises when threading is used.

2. check if the files you have added as assets do exist through a browser.

The point is that it should not matter if the assets are accessible or not. The package should handle this correctly by ignoring anything that throws a 404 error. And it does, unless it uses threads. So clearly your implementation of threading has a bug in it.

3. try setting the timeout to 0 or 0.001 seconds.

This has no impact other than speeding up the point of hanging.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

Actually, that's not correct. When the function completes (which is not always the case), I get a list of of threads that were registered as is_active() after being joined. Unfortunately, even when the function completes, the process does not end.

Jun 14 '20 09:06 deleuzer

If your usecase is working out without threading then you should go for it for now. I will unittest it out a little bit more or you can contribute through pr.

Jun 15 '20 15:06 rajatomar788

I eventually went the slower single-threaded route using something similar to what @junyango suggested above, and that dramatically reduced the frequency with which it hung. I still encountered a few problematic sites that wouldn't finish saving even after left for hours/overnight - though I had some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

But that made me wonder: would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists? It seems like the existence of an overall timeout (even if it resulted in a few more errors per thousand pages) would resolve a lot of the pain here. Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

Regardless, thanks for all of your help on this @rajatomar788 !

Jul 07 '20 23:07 youngblood

@youngblood

some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

Preliminary examinations suggests this to be caused by requests library or the underlying urllib3 library timeouts. I am trying to figure out the exact issue if it is fixable then I would do that.

would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists?

The pywebcopy is designed to separate the different resources download to avoid pulling everything in the ram and causing computer to hang. So, only a complete rewrite could allow such a thing only if possible.

Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

No worries :) Have a great day.

Jul 08 '20 04:07 rajatomar788

I also facing the same problem. The program doesn't exit. Is there any thing to do with? @rajatomar788

Jul 21 '20 07:07 GalDayan

Did you try it using single thread as mentioned above? @GalDayan

Jul 21 '20 07:07 rajatomar788

@rajatomar788

Did you try it using single thread as mentioned above? @GalDayan

I've tried to single thread with this code

for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

But, when I commented this code, it downloaded only HTML without css, fonts, etc..

Jul 21 '20 08:07 GalDayan

@deleuzer when I try your code something weird happens. At the first run, it gets stuck at "Queuing download of 100 assets" as @junyango mentioned. Then I had to halt it. But when I ran the same code second time it completed and stopped on its own. To be sure, I did the same second time. Same thing happened; the first run hanged, the second run worked.

What could be the reason for it? It's very interesting.

EDIT: When I both try your code and make the correction @junyango suggested, it completes downloading a web page with 180 assets in 2-3 minutes. And works at the first run. I guess this is the solution.

Oct 07 '20 10:10 ghost

Hello. Looks like combination of these two generating some kind of semaphore deadlock. Not figuring out what kind exactly. https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/webpage.py#L230 https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/elements.py#L56

As solution offer this:

with multiprocessing.pool.ThreadPool(processes=5) as tp:
    for _ in tp.imap(lambda e: e.run(), elms):
        pass

def run(self):
    self.download_file()

P.S. Offer remove this one and replace it with parameter which allow set worker processes amount. https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/globals.py#L97

Mar 21 '21 19:03 CutePotatoDev

If anyone is facing this issue and needs pywebcopy to work, check out this version where I removed all multithreading. It did not seem that much slower, but it definitely does not hang!! Here's the commit in case we want to support a single-threaded version of pywebcopy in the future. I would be happy to make pywebcopy single-threaded by default with the optional feature of using multithreading.

From my limited experience with pywebcopy, multithreading is currently broken and made it impossible to copy websites. Supporting single threading would make this package 1000000% better in my opinion :)

Note: I did try setting

POOL_LIMIT = threading.Semaphore(1)

but that didn't seem to work.

cc @rajatomar788

https://github.com/davidwgrossman/pywebcopy

Jun 12 '21 21:06 gravelcycles

Hey @davidwgrossman

I appreciate you making the pywebcopy single threaded.

Originally the pywebcopy was made to run on single thread but large graphics websites would force you to think of multithreading but time limitations on my side have prevented a proper implementation.

I would love to see a single threaded pywebcopy with a optional feature of multithreading if you are up for the task.

Jun 13 '21 03:06 rajatomar788

Hello @rajatomar788

First of all, thank you for this great library!

I added additional implementation on top of @davidwgrossman's code. To run multi-thread on save_asset method.

It seem to fix the hanging issue and I think we can control the single/multi thread from number of pool. Here is the commit.

https://github.com/darawaleep/pywebcopy/commit/6d0af9dd3f02e4a863e009319aff46178b52a352

Jun 21 '21 14:06 darawaleep

@darawaleep

The concurrent library is not available in the python 2 version. Hence it can't be the solution for the multithreading issue that we are facing currently.

Jun 22 '21 09:06 rajatomar788

@rajatomar788 here is a PR I put up to disable multithreading by default. Comments and suggestions are welcome :) https://github.com/rajatomar788/pywebcopy/pull/78/

Aug 05 '21 04:08 gravelcycles

pywebcopy pywebcopy copied to clipboard

program hangs and does not exit

pywebcopy
pywebcopy copied to clipboard