ReadabiliPy
ReadabiliPy copied to clipboard
ReadabiliPy from multiple threads
Running ReadabiliPy from multiple Python processes causes the /tmp/full.html files to be overwritten when using the Node readability code.
Relevant code here: https://github.com/alan-turing-institute/ReadabiliPy/blob/554327240dd8d8178fb460cc24cb762e0a45e366/readabilipy/simple_json.py#L33-L44
Could we add just a random UUID to the HTML file name to support running this in multiple processes?
Hi @econaxis 👋 Thanks for the suggestion. I think the cleanest way to do this might be to use the functions in the Python tempfile module. I was thinking of using tempfile.TemporaryFile()
with delete=false
in a with
scope, maybe wrapping that in an outer with
scope that creates a dedicated temporary directory. Something like:
with tempfile.TemporaryDirectory() as temp_dir:
with tempfile.TemporaryFile(dir=temp_dir, delete=false) as f_html:
f_html.write(html)
f_html.close() # The docs say Windows doesn't let the file be opened again while still open
html_path = os.join(temp_dir, f_html.name)
# Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file
with tempfile.TemporaryFile(dir=temp_dir, delete=false) as f_json:
f_json.close() # The docs say Windows doesn't let the file be opened again while still open
json_path = os.join(temp_dir, f_json.name)
jsdir = os.path.join(os.path.dirname(__file__), 'javascript')
with chdir(jsdir):
subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", json_path])
# Read output of call to Readability.parse() from JSON file and return as Python dictionary
input_json = json.loads(json_path)
# The docs say temporary directory and its contents are removed when the scope closes, so
# we shouldn't need to explciily delete the HTML and JSON temp files. If we do need to, we
# could call os.remove(f.name) or os.unlink(f.name)
What are your thoughts?
[Edit 1: consistent use of temp_dir
]
[Edit 2: include parsing JSON file in temp dir scope and use temp file for JSON output]
Instead of communicating via file system API's, could we communicate via stdin/stdout pipes using subprocess,Popen
? Since our HTML inputs and JSON outputs are just text instead of binary, we shouldn't lose any information/special characters from doing it in this manner. Plus, we don't need to wait for filesystem which could be slow (opening/writing/closing the HTML). I could write a PR to show how I think I would do this.
If we must do using file system API's:
with tempfile.NamedTemporaryFile(delete=False) as f_html:
f_html.write(html)
f_html.close() # The docs say Windows doesn't let the file be opened again while still open
html_path = f_html.name # f_html.name returns absolute path
# If HTML input is /tmp/random.html, then JSON output is written to /tmp/random.html.json file
json_path = f_html.name + '.json'
jsdir = os.path.join(os.path.dirname(__file__), 'javascript')
with chdir(jsdir):
subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", json_path])
# Read output of call to Readability.parse() from JSON file and return as Python dictionary
input_json = json.loads(json_path)
# The docs say temporary directory and its contents are removed when the scope closes, so
# we shouldn't need to explciily delete the HTML and JSON temp files. If we do need to, we
# could call os.remove(f.name) or os.unlink(f.name)
Changes:
TemporaryFile
to NamedTemporaryFile
Removed TemporaryDirectory
Removed creating JSON temporary file (file doesn't need to exist as Node will write to that filename once its done)
Removed os.join
because file.name returns absolute path
this is fixed in https://github.com/alan-turing-institute/ReadabiliPy/pull/98