ReadabiliPy icon indicating copy to clipboard operation
ReadabiliPy copied to clipboard

ReadabiliPy from multiple threads

Open econaxis opened this issue 3 years ago • 2 comments

Running ReadabiliPy from multiple Python processes causes the /tmp/full.html files to be overwritten when using the Node readability code.

Relevant code here: https://github.com/alan-turing-institute/ReadabiliPy/blob/554327240dd8d8178fb460cc24cb762e0a45e366/readabilipy/simple_json.py#L33-L44

Could we add just a random UUID to the HTML file name to support running this in multiple processes?

econaxis avatar Apr 25 '21 20:04 econaxis

Hi @econaxis 👋 Thanks for the suggestion. I think the cleanest way to do this might be to use the functions in the Python tempfile module. I was thinking of using tempfile.TemporaryFile() with delete=false in a with scope, maybe wrapping that in an outer with scope that creates a dedicated temporary directory. Something like:

with tempfile.TemporaryDirectory() as temp_dir:
  with tempfile.TemporaryFile(dir=temp_dir, delete=false) as f_html:
    f_html.write(html)
    f_html.close() # The docs say Windows doesn't let the file be opened again while still open
    html_path = os.join(temp_dir, f_html.name)
     
      # Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file
      with tempfile.TemporaryFile(dir=temp_dir, delete=false) as f_json:
        f_json.close() # The docs say Windows doesn't let the file be opened again while still open
        json_path = os.join(temp_dir, f_json.name)
        jsdir = os.path.join(os.path.dirname(__file__), 'javascript') 
        with chdir(jsdir): 
          subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", json_path]) 

        # Read output of call to Readability.parse() from JSON file and return as Python dictionary
        input_json = json.loads(json_path)

        # The docs say temporary directory and its contents are removed when the scope closes, so 
        # we shouldn't need to explciily delete the HTML and JSON temp files. If we do need to, we 
        # could call os.remove(f.name) or os.unlink(f.name)


What are your thoughts?

[Edit 1: consistent use of temp_dir] [Edit 2: include parsing JSON file in temp dir scope and use temp file for JSON output]

martintoreilly avatar Apr 27 '21 18:04 martintoreilly

Instead of communicating via file system API's, could we communicate via stdin/stdout pipes using subprocess,Popen? Since our HTML inputs and JSON outputs are just text instead of binary, we shouldn't lose any information/special characters from doing it in this manner. Plus, we don't need to wait for filesystem which could be slow (opening/writing/closing the HTML). I could write a PR to show how I think I would do this.

If we must do using file system API's:

with tempfile.NamedTemporaryFile(delete=False) as f_html:
    f_html.write(html)
    f_html.close() # The docs say Windows doesn't let the file be opened again while still open
    html_path = f_html.name # f_html.name returns absolute path
     

    # If HTML input is /tmp/random.html, then JSON output is written to /tmp/random.html.json file
    json_path = f_html.name + '.json'
    jsdir = os.path.join(os.path.dirname(__file__), 'javascript') 

    with chdir(jsdir): 
      subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", json_path]) 

    # Read output of call to Readability.parse() from JSON file and return as Python dictionary
    input_json = json.loads(json_path)

    # The docs say temporary directory and its contents are removed when the scope closes, so 
    # we shouldn't need to explciily delete the HTML and JSON temp files. If we do need to, we 
    # could call os.remove(f.name) or os.unlink(f.name)

Changes: TemporaryFile to NamedTemporaryFile Removed TemporaryDirectory Removed creating JSON temporary file (file doesn't need to exist as Node will write to that filename once its done) Removed os.join because file.name returns absolute path

econaxis avatar Apr 29 '21 17:04 econaxis

this is fixed in https://github.com/alan-turing-institute/ReadabiliPy/pull/98

nelson-liu avatar Feb 13 '23 22:02 nelson-liu