pyscript icon indicating copy to clipboard operation
pyscript copied to clipboard

Improve documentation on how to handle binary and non-binary files (local/remote, up-/download)

Open do-me opened this issue 3 years ago • 12 comments

Checklist

  • [X] I added a descriptive title
  • [X] I searched for other issues and couldn't find a duplication
  • [X] I already searched in Google and didn't find any good information or help

What is the issue/comment/problem?

There are a few issues around here concerned with file handling (#588, #558, #463, #151 amongst others). It would be nice to have a dedicated section in the docs with the recommended way of doing things for binary and non-binary files. Summed up:

Local

  • Load local file to browser (covered here or here)
  • Download file from browser to local (two examples here with file picker, but non-binary data only)

Remote

  • Load remote file to browser (covered in #588)
  • ~Download file from browser from remote~ (that should hopefully be impossible)

Due to the different nature of (non-) binary files (e.g. excel or genereally zip files), it would be very useful to have the differentiation included as else one stumples across missing await's or similar.

I think most of the above points are already described somewhere but I'm missing an example of how to conveniently access the virtual file system in order to download something locally.

Let's consider this:

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO

response = await pyfetch(url="/downloads/test.xlsx", method="GET")
bytes_response = await response.bytes()
df = pd.read_excel(BytesIO(bytes_response))
df

That's the (currently) easiest way of loading binary files. If I call df.to_excel("test_output.xlsx") and df.to_csv("test_output.csv") pandas will save the output to the virtual file system.

What's the best way of automatically starting the download from the browser to local when pandas is done saving to the virtual file system or could this even be skipped in some way? Do we need to use some js proxy, js buffer for the hooks or would you simply use some pyodide function for this?

do-me avatar Jul 12 '22 07:07 do-me

I'm not sure if this issue was discussed last week when I wasn't around. But maybe @antocuni has an opinion on this? Or should I ping Fabio here? Thanks! And thanks @do-me for opening the issue =)

marimeireles avatar Jul 20 '22 21:07 marimeireles

What's the best way of automatically starting the download from the browser to local when pandas is done saving to the virtual file system or could this even be skipped in some way?

AFAIK we don't have a PyScript specific way to do it, so currently the best way is to use pyodide. This stackoverflow answer shows a possible solution: https://stackoverflow.com/questions/64669355/how-to-copy-download-file-created-in-pyodide-in-browser

Yes, we should provide a more straightforward way of doing it. Yes, we should definitely improve the docs :).

antocuni avatar Jul 22 '22 07:07 antocuni

Thanks, later I'll look into it. Meanwhile I might have found a different cross-browser solution for downloading blobs, will test later and update here. Once I get this running, I'll document everything and set up minimal examples for every variant.

do-me avatar Jul 22 '22 07:07 do-me

Just sharing some WIP in case anyone needs it asap. Loading a remote excel file .xlsx, reading as pandas df and downloading as .csv with the file picker solution.

Requires a download HTML button on the page, e.g. <button id="download">Download</button>

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO
import sys
from js import alert, document, Object, window
from pyodide import create_proxy, to_js

async def load_df():
  response = await pyfetch(url="/downloads/test.xlsx", method="GET")
  bytes_response = await response.bytes()
  df = pd.read_excel(BytesIO(bytes_response))
  content = df.to_csv() # returns string when file name missing 
  return content
  
async def file_save(event):
	try:
		options = {
			"startIn": "downloads",
			"suggestedName": "test_123456.csv"
		}

		fileHandle = await window.showSaveFilePicker(Object.fromEntries(to_js(options)))
	except Exception as e:
		console.log('Exception: ' + str(e))
		return

	content = await load_df()

	file = await fileHandle.createWritable()
	await file.write(content)
	await file.close()
	return

def setup_button():
	# Create a Python proxy for the callback function
	file_save_proxy = create_proxy(file_save)

	# Set the listener to the callback
	document.getElementById("download").addEventListener("click", file_save_proxy, False)

setup_button()

I'm working on a) cross-browser functionality as file picker isn't working in Firefox and b) blob (= e.g. xlsx files) downloads.

do-me avatar Jul 22 '22 15:07 do-me

Not sure if this is super related, but we created a WordPress plugin around Pyscript, and most of the examples work on the site. However, it always throws this error when we try to read a remote csv file in pandas or even just read a remote URL. Is this related to encoding? The weird thing is the other examples, including the matplotlib one works.

image

hellozeyu avatar Jul 26 '22 16:07 hellozeyu

Hi @hellozeyu this is not related as your URL is simply wrong. You're trying to read a csv from the GitHub landing page https://github.com/. Insert the real link (raw csv file, not the repo) and it should work. E.g. this one.

do-me avatar Jul 27 '22 06:07 do-me

Ah sorry, thought this was pandas-related. You cannot work with the urllib or requests package in pyscript but need to use the pyodide alternatives. See this example.

do-me avatar Jul 27 '22 06:07 do-me

Got it. It works for me. Thanks!

hellozeyu avatar Jul 27 '22 14:07 hellozeyu

@do-me Do you think the solution linked in this issue fits your use case? https://github.com/pyscript/pyscript/issues/756 Also, I think you already found a solution? Not sure the last time we talked you said you had something almost working? Lemme know if you need help, we can sync =) I think it'd be really cool to have docs on it and we can do it on a style of "how to". Basically just some code snippets that work and a short explanation on why it works that way it does it'd be perfect. Jeff Glass contributed something like this last week: https://docs.pyscript.net/latest/howtos/passing-objects.html The one about output could be much shorter though.

marimeireles avatar Sep 12 '22 13:09 marimeireles

Hi @marimeireles! Thanks for coming back to this issue. I'm not at home this week but I'll have a look at the new docs next week. Looks promising!

I might have found a way for the last missing piece (binary downloads like excel) via octet streams and DOM manipulation. I didn't find the time yet to test properly, but as soon as I succeed, I'll come back here! So technically the issue is not yet 100% solved I'd say.

Great idea for the snippet-style docs - I think that really suits the spirit of pyscript!

do-me avatar Sep 12 '22 18:09 do-me

Alright! :) I'm around just ping me.

marimeireles avatar Sep 15 '22 10:09 marimeireles

I finally found the time for testing binary downloads from the virtual file system. I wrote a simple function that takes care of everything and saves a pandas excel export to the local file system:

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO
import base64
from js import document

def pandas_excel_export(df, filename):
    # save to virtual filesystem
    df.to_excel(filename + ".xlsx")

    # binary xlsx to base64 encoded downloadable string 
    data = open("test.xlsx", 'rb').read()
    base64_encoded = base64.b64encode(data).decode('UTF-8')
    octet_string = "data:application/octet-stream;base64,"
    download_string = octet_string + base64_encoded

    # create new helper DOM element, click (download) and remove 
    element = document.createElement('a')
    element.setAttribute("href",download_string)
    element.setAttribute("download",filename + ".xlsx")
    element.click()
    element.remove()

# import 
response = await pyfetch("/downloads/test.xlsx", method="GET")
bytes_response = await response.bytes()

# read from bytes
df = pd.read_excel(BytesIO(bytes_response))

# manipulate
df["d"] = df["a"] + df["b"]

# export
pandas_excel_export(df,"test")

Working example here.

Coming back to the original purpose of this issue, I think we have everything we need to improve the documentation!

What do you think about a dedicted `File Handling` section in the docs under Getting Started? Or would you rather think it belongs more to the How-to section?

I am preparing a dedicated blog post in the spirit of the original issue description (local/remote & import/export & non-binary/binary data) that could serve as a base for further discussion.

do-me avatar Oct 03 '22 12:10 do-me

I am closing this for the following reasons:

  • we now offer a fetch(...).bytearray() to solve the conversion issue
  • we have documented how to write, read, upload, download files via latest PyScript
  • binary VS non binary is still a matter of open(..., 'rb') VS open(..., 'r') so I hope we covered it all

WebReflection avatar May 06 '24 14:05 WebReflection