pywebcopy icon indicating copy to clipboard operation
pywebcopy copied to clipboard

program hangs and does not exit

Open youngblood opened this issue 4 years ago • 28 comments

Trying Examples 1 & 2 from the "How to - Save Single Webpage" section in readme.md, as well as method 3 from examples.py. Using python 3.7, pywebcopy 6.3, and one of the example URLs from example.py: 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'

Issues: Method 1 & 2 hang every time. Method 3 appears to be deprecated. Nothing appears in my log_file with this approach, so difficult to troubleshoot further. And the join_timeout setting doesn't appear to have any effect.

Based on the other open issue (#35 ), I also included the thread-closing loop from examples.py.

Files are downloading, but when I try to open the main HTML file it never shows any of the images (perhaps it never got to the point of saving them?).

My code, modified from examples:

import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
project_folder = '/Users/user/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/user/Downloads/scraped_content/pwc_log.log',
	join_timeout=30
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - This one also hangs every time
wp = pywebcopy.WebPage()
wp.get(project_url)
wp.save_complete()
wp.shutdown()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

for thread in threading.enumerate():
    if thread == threading.main_thread():
        continue
    else:
        thread.join()

print("Execution time : ", preferred_clock() - start)```

youngblood avatar Apr 23 '20 23:04 youngblood

Value of join_timeout is applied to each thread, and you have set it to 30, so its most likely waiting on each thread for 30 seconds and looking like it froze.

rajatomar788 avatar Apr 24 '20 06:04 rajatomar788

The log_file parameter is removed due to flushing errors. See #36.

rajatomar788 avatar Apr 24 '20 06:04 rajatomar788

Well I let it run (Method 2 above) for hours overnight last night, with just that one URL to scrape, but it was still stuck at the same place this morning: image

Then I set join_timeout to 5 and tried again, still with just the one codeburst.io URL, and got the same result.

Then I tried with two other URLs: https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html and http://www.history.com/topics/cold-war/hollywood-ten

So then I tried using Method 2 on a list of 3 URLs, just to see if it would get past the first one. It didn't. It still hangs on the first url in the list:

# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
# project_url = 'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html'
# project_url = 'http://www.history.com/topics/cold-war/hollywood-ten'
project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
	join_timeout=5
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - I made this one up based on what I pieced together.
urls = [
	'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
	'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
	'http://www.history.com/topics/cold-war/hollywood-ten'
]

for url in urls:
	wp = pywebcopy.WebPage()
	wp.get(url)
	wp.save_complete()
	wp.shutdown()
	for thread in threading.enumerate():
	    if thread == threading.main_thread():
	        continue
	    else:
	        thread.join()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

print("Execution time : ", preferred_clock() - start)```

youngblood avatar Apr 24 '20 14:04 youngblood

The internal code is changed much without changing the examples.py, so I would do it this way


from pywebcopy import WebPage
from pywebcopy import config


def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

rajatomar788 avatar Apr 25 '20 04:04 rajatomar788

I'll try that one next - thank you for your help!

youngblood avatar Apr 27 '20 00:04 youngblood

Issue should be fixed. I am closing it.

rajatomar788 avatar May 04 '20 01:05 rajatomar788

yeah @rajatomar788 it hangs now and then!

So im just looping through a list of websites, and calling scrape method as suggested by you above.

Nevertheless, it usually hangs with this log: Queueing download of asset files.

image

from pywebcopy import WebPage
from pywebcopy import config

def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

def main():
	output_folder = "jy_scrape"

	links = initialize_list("Location to txt file")

	for link in links:
		try:
			scrape(link, output_folder)
		except Exception:
			continue

if __name__ == '__main__':
	main()

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

I temporarily stopped the hanging by changing download_file to be not multi-threaded

junyango avatar Jun 02 '20 17:06 junyango

@junyango program hanging could be result of many factors, low ping could be one of them. If you have used the code as above then you should definitely check the ping.

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

Yes you could use any implementation, whichever works for you.

I temporarily stopped the hanging by changing download_file to be not multi-threaded

How did you do it?

rajatomar788 avatar Jun 04 '20 06:06 rajatomar788

@rajatomar788 did it hang for your case? Under the save_assets method in webpage.py, i just changed

 for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

yeah. this temporarily stops the problem from occurring since elem.run calls download_file() which does its own session.get implementation.

because i realized in your implementation, the joining of threads is actually done in shutdown() method, but having tried that it still doesnt work. It gets stuck at "Queuing download of assets", so i tried to isolate the problem and found that it was the code above that was causing the problem. Now im running, albeit somewhat slower, but works pretty reliably i think.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

junyango avatar Jun 04 '20 06:06 junyango

@junyango your implementation just undoes the entire parallel downloading capabilities of the pywebcopy.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

of course single threading is error proof. If you are not heavy-lifting images filled pages then it should be good for you.

rajatomar788 avatar Jun 04 '20 07:06 rajatomar788

@rajatomar788 Yup, wanted to get a working version up and running. However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

Anyway, would like to thank you for the prompt replies! :)

junyango avatar Jun 04 '20 07:06 junyango

@junyango

However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

If there use case is simple then there shouldn't be any problem. But for the special cases, this thread will help them.

Regards.

rajatomar788 avatar Jun 04 '20 07:06 rajatomar788

I'm still seeing this behavior. I've tried using the newer approach, but to no avail. I have discovered a few things, though. When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

However, those errors do not appear when using threading. Instead, the join() method gets called and no error messages get written.

It seems like there's a problem with how threads deal with assets that do not exist, though that's not the full story. When the head only contains non-existent references, the script completes. I don't understand threading enough to make sense of this.

I've narrowed this down to the simplest possible failure. On a local server, if I try to save_complete() on a site with a single index.html page that contains the markup below (but without any of the assets), the script does not complete.

However, if I comment out any single line in the

tag (or I add the missing assets), the script completes and exists.
<html>
  <head>
    <meta charset="utf-8">
    <link href="https://use.fontawesome.com/releases/v5.0.8/css/all.css" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Roboto+Slab:700" rel="stylesheet">
    <link name="first-static-path-to-nothing" href="/static/path/to/first/nothing.png" >
    <link name="second-static-path-to-nothing" href="/static/path/to/second/nothing.png" >
</head>
  <body>
    Here comes everybody.
  </body>
</html>

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

from pywebcopy import WebPage
from pywebcopy import config

url = "http://webdowntest.local"

def scrape(url, folder, timeout=1):

    config.setup_config(url, folder, debug=True)

    wp = WebPage()
 
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)
           print(f'AFTER JOIN: {t.name}')
    
    active_threads = []
    for i in wp._threads:
        if i.is_alive():
            active_threads.append(i)

    # location of the html file written
    return [wp.file_path, active_threads]

a = scrape(url, 'webdown2')
print(a)

deleuzer avatar Jun 13 '20 23:06 deleuzer

@deleuzer

When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

However, if I comment out any single line in the

tag (or I add the missing assets), the script completes and exists.

You can try a few things like -

  1. check your internet connections ping
  2. check if the files you have added as assets do exist through a browser.
  3. try setting the timeout to 0 or 0.001 seconds.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

rajatomar788 avatar Jun 14 '20 03:06 rajatomar788

@deleuzer

elements - ERROR - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

Obviously, that's why I pointed out the difference between non-threading and threading.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like - You can try to reproduce the problem following the very easy steps I gave you. That's why I offered such a robust comment with the simplest implementation of the problem.

1. check your internet connections ping

Like i said, I'm querying a local server. As for the remote files, I can 'wget' them without a problem, use the python requests module to get them just fine, and download them without threading with this very package. The problem arises when threading is used.

2. check if the files you have added as assets do exist through a browser.

The point is that it should not matter if the assets are accessible or not. The package should handle this correctly by ignoring anything that throws a 404 error. And it does, unless it uses threads. So clearly your implementation of threading has a bug in it.

3. try setting the timeout to 0 or 0.001 seconds.

This has no impact other than speeding up the point of hanging.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

Actually, that's not correct. When the function completes (which is not always the case), I get a list of of threads that were registered as is_active() after being joined. Unfortunately, even when the function completes, the process does not end.

deleuzer avatar Jun 14 '20 09:06 deleuzer

If your usecase is working out without threading then you should go for it for now. I will unittest it out a little bit more or you can contribute through pr.

rajatomar788 avatar Jun 15 '20 15:06 rajatomar788

I eventually went the slower single-threaded route using something similar to what @junyango suggested above, and that dramatically reduced the frequency with which it hung. I still encountered a few problematic sites that wouldn't finish saving even after left for hours/overnight - though I had some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

But that made me wonder: would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists? It seems like the existence of an overall timeout (even if it resulted in a few more errors per thousand pages) would resolve a lot of the pain here. Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

Regardless, thanks for all of your help on this @rajatomar788 !

youngblood avatar Jul 07 '20 23:07 youngblood

@youngblood

some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

Preliminary examinations suggests this to be caused by requests library or the underlying urllib3 library timeouts. I am trying to figure out the exact issue if it is fixable then I would do that.

would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists?

The pywebcopy is designed to separate the different resources download to avoid pulling everything in the ram and causing computer to hang. So, only a complete rewrite could allow such a thing only if possible.

Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

No worries :) Have a great day.

rajatomar788 avatar Jul 08 '20 04:07 rajatomar788

I also facing the same problem. The program doesn't exit. Is there any thing to do with? @rajatomar788

GalDayan avatar Jul 21 '20 07:07 GalDayan

Did you try it using single thread as mentioned above? @GalDayan

rajatomar788 avatar Jul 21 '20 07:07 rajatomar788

@rajatomar788

Did you try it using single thread as mentioned above? @GalDayan

I've tried to single thread with this code

for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

But, when I commented this code, it downloaded only HTML without css, fonts, etc..

GalDayan avatar Jul 21 '20 08:07 GalDayan

@deleuzer when I try your code something weird happens. At the first run, it gets stuck at "Queuing download of 100 assets" as @junyango mentioned. Then I had to halt it. But when I ran the same code second time it completed and stopped on its own. To be sure, I did the same second time. Same thing happened; the first run hanged, the second run worked.

What could be the reason for it? It's very interesting.

EDIT: When I both try your code and make the correction @junyango suggested, it completes downloading a web page with 180 assets in 2-3 minutes. And works at the first run. I guess this is the solution.

ghost avatar Oct 07 '20 10:10 ghost

Hello. Looks like combination of these two generating some kind of semaphore deadlock. Not figuring out what kind exactly. https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/webpage.py#L230 https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/elements.py#L56

As solution offer this:

with multiprocessing.pool.ThreadPool(processes=5) as tp:
    for _ in tp.imap(lambda e: e.run(), elms):
        pass
def run(self):
    self.download_file()

P.S. Offer remove this one and replace it with parameter which allow set worker processes amount. https://github.com/rajatomar788/pywebcopy/blob/2852d1856783be3bc7e4725c5df850b15defa70d/pywebcopy/globals.py#L97

CutePotatoDev avatar Mar 21 '21 19:03 CutePotatoDev

If anyone is facing this issue and needs pywebcopy to work, check out this version where I removed all multithreading. It did not seem that much slower, but it definitely does not hang!! Here's the commit in case we want to support a single-threaded version of pywebcopy in the future. I would be happy to make pywebcopy single-threaded by default with the optional feature of using multithreading.

From my limited experience with pywebcopy, multithreading is currently broken and made it impossible to copy websites. Supporting single threading would make this package 1000000% better in my opinion :)

Note: I did try setting

POOL_LIMIT = threading.Semaphore(1) 

but that didn't seem to work.

cc @rajatomar788

https://github.com/davidwgrossman/pywebcopy

gravelcycles avatar Jun 12 '21 21:06 gravelcycles

Hey @davidwgrossman

I appreciate you making the pywebcopy single threaded.

Originally the pywebcopy was made to run on single thread but large graphics websites would force you to think of multithreading but time limitations on my side have prevented a proper implementation.

I would love to see a single threaded pywebcopy with a optional feature of multithreading if you are up for the task.

rajatomar788 avatar Jun 13 '21 03:06 rajatomar788

Hello @rajatomar788

First of all, thank you for this great library!

I added additional implementation on top of @davidwgrossman's code. To run multi-thread on save_asset method.

It seem to fix the hanging issue and I think we can control the single/multi thread from number of pool. Here is the commit.

https://github.com/darawaleep/pywebcopy/commit/6d0af9dd3f02e4a863e009319aff46178b52a352

darawaleep avatar Jun 21 '21 14:06 darawaleep

@darawaleep

The concurrent library is not available in the python 2 version. Hence it can't be the solution for the multithreading issue that we are facing currently.

rajatomar788 avatar Jun 22 '21 09:06 rajatomar788

@rajatomar788 here is a PR I put up to disable multithreading by default. Comments and suggestions are welcome :) https://github.com/rajatomar788/pywebcopy/pull/78/

gravelcycles avatar Aug 05 '21 04:08 gravelcycles