pygmt
pygmt copied to clipboard
Consistent table-like output for PyGMT functions/methods
Originally posted by @weiji14 in https://github.com/GenericMappingTools/pygmt/pull/1284#discussion_r639581006:
Currently, the blockm*/grdtrack modules simply outputs pandas.DataFrame/filenames (see #1099) automatically depending on the input type (2 options). The implementation @willschlitzer made at bf58dc2 and 68fab7c returns either numpy/pandas/str depending on the user input (3 options). Best to be consistent across PyGMT on the types of outputs users can expect.
I am opening this issue to find out what the output format/options should be with the additional of x/y/z in blockmean, blockmedian, and grdtrack, as well as for new methods such as grd2xyz.
Edit by @seisman on Oct 9, 2023:
The PyGMT team has reach an agreement that PyGMT functions/methods should have consistent behavior for table-like outputs, i.e., the output depends on the output_type
parameter.
Valid output_type
values are:
-
file
: output to a file directly (usually need to setoutfile
) -
numpy
: return a NumPy array (may not always possible because data must be in the same dtype in a 2-D numpy array) -
pandas
: return a pandas.DataFrame (sometimes need to setnewcolname
)
Here are a list of functions/methods that have table-like outputs:
- [x]
blockm*
#3103 - [x]
filter1d
#3085 - [ ]
grdinfo
- [x]
grdtrack
#3106 - [x]
triangulate
#3107 - [x]
grdvolume
#3102 - [x]
select
#3108 - [ ]
which
- [x]
grdhisteq
#3109 - [ ]
x2sys_cross
- [ ]
info
- [x]
project
#3110 - [x]
grd2xyz
#3097
We need to make sure they behave consistently.
Just wanted to say, this is a similar issue to that in the RAPIDS AI cuml
library. See https://docs.rapids.ai/api/cuml/0.19/api.html#output-data-type-configuration. Their solution is to:
- By default: 'mirror' the input data format to be the same as the output format as much as possible, i.e.
numpy
in/numpy
out, orpandas.DataFrame
in/pandas.DataFrame
out. - Override the default behaviour above by having a
using_output_type
context manager. E.g. setusing_output_type=numpy
to returnnumpy
regardless of input type.
As you can see, this involves more levels of I/O complexity going from phase 1 to phase 2. I would advocate to move to phase '1' for PyGMT v0.5.0 or v0.6.0, and keep PyGMT's default output type for table-like outputs as pandas.DataFrame
(or None if outfile is set) for PyGMT v0.4.0 as per #1099 (i.e phase '0' if you will). Longer term, as PyGMT matures and integrates more with the PyData stack, we may want to consider mimicking cuml
's configurable I/O behaviour.
That said, I'm open to other suggestions, if someone has a good idea on how to deal with I/O in PyGMT.
Thanks for the helpful reference, @weiji14! I like the mirroring behavior and agree that it would be a good milestone for v0.5.0+.
In my opinion, a parameter to control the output type seems a bit simpler for users than a context manager (examples below). The options for pygmt.using_output_type
may also depend on the specific function/method, which could become difficult to document. The context manager would be useful if users want to easily specify the output type for a block of code. Are there other benefits to implementing the configurable I/O behavior with a context manager rather than with function/method parameters?
Context manager example:
with pygmt.using_output_type("numpy"):
output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30])
Parameter example:
output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30], output_format="numpy")
You're right that controlling the output type via a parameter (similar to what @willschlitzer did in bf58dc2defc3b6ff67e4769322505d63e4ad941d) would be simpler than a context manager. I don't have a preference for either style, but after doing a bit of digging, it seems like cuml
had an output_type
parameter at one point, but they deprecated it in favour of the context manager option in https://github.com/rapidsai/cuml/pull/3040 for some reason (though it still seems to be mentioned in places like at https://github.com/rapidsai/cuml/blob/branch-21.08/wiki/python/ESTIMATOR_GUIDE.md#specifying-the-array-output-type).
Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame
or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2defc3b6ff67e4769322505d63e4ad941d, but have it in a single standardized place.
Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it
pandas.DataFrame
or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.
Yes, I agree that a helper function is a good idea for the implementation of an output format choice.
I think the helper function would be our best bet (although I'm not too strong with writing decorators). Borrowing/stealing entirely from my commits mentioned above, I think something like this would could be helpful:
def return_table(result, data_format, format_parameter):
if data_format == "s":
return result
data_list = []
for string_entry in result.strip().split("\n"):
float_entry = []
string_list = string_entry.strip().split()
for i in string_list:
float_entry.append(float(i))
data_list.append(float_entry)
data_array = np.array(data_list)
if data_format == "a":
result = data_array
elif data_format == "d":
result = pd.DataFrame(data_array)
else:
raise GMTInvalidInput(f"""Must specify {format_parameter} as either a, d, or s.""")
return result
I know it's a little redundant accepting the result
argument only to immediately return it if data_format
is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput
error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.
The return_table function looks to be on the right track. We can discuss this in the meeting tomorrow (https://forum.generic-mapping-tools.org/t/pygmt-community-zoom-meeting-tuesday-15-june-2021-utc-1700/1762) if there's time.
In terms of your PRs #1284 and #1299 though @willschlitzer, I'd suggest sticking to a str
or pandas.DataFrame
only output for PyGMT v0.4.0 if you want to get those PRs in on time (and not wait until v0.5.0). I.e. follow the code used by grdtrack
and blockm*
: https://github.com/GenericMappingTools/pygmt/blob/a10af5d8deb3bf3090eab4b6492bcf8cf722cb71/pygmt/src/grdtrack.py#L273-L281
I know it's a little redundant accepting the
result
argument only to immediately return it ifdata_format
is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into theGMTInvalidInput
error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.
You're welcome to start a proof of concept PR, but realistically speaking, I don't expect it to be ready for PyGMT v0.4.0. See this blog post by the RAPIDS AI team (https://medium.com/rapids-ai/making-rapids-cudf-io-readers-and-writers-more-reliable-with-fuzz-testing-5d595aa97389) which talks about the complexity of testing different I/O format combinations. Just want to manage expectations a bit :slightly_smiling_face:
@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz
and grdvolume
in line with grdtrack
for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume
and grd2xyz
are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.
@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for
grd2xyz
andgrdvolume
in line withgrdtrack
for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't thinkgrdvolume
andgrd2xyz
are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.
Thanks for the detailed explanation and for submitting the proof-of-concept PR. I understand your points. I'll add some comments on https://github.com/GenericMappingTools/pygmt/pull/1336, but I agree with you that it's not likely we will reach consensus and update the other methods/functions with table output in the next few days. So, it seems https://github.com/GenericMappingTools/pygmt/issues/1318, https://github.com/GenericMappingTools/pygmt/pull/1336, https://github.com/GenericMappingTools/pygmt/pull/1299, and https://github.com/GenericMappingTools/pygmt/pull/1284 will be on track for v0.5.0.
I support Will wanting to merge in https://github.com/GenericMappingTools/pygmt/pull/1299 and https://github.com/GenericMappingTools/pygmt/pull/1284 before addressing this issue to keep things moving. At the same time, I would prefer that the argument format for the output_type
parameter is final (i.e., will be used for all applicable functions) before merging these PRs to avoid deprecations. Here's the current section:
output_type : str
Determine the format the xyz data will be returned in:
**a**: numpy array
**d**: pandas DataFrame [Default option]
**s**: string
When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.
Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?
When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.
Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?
@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!
When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'. Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?
@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!
Agree with having longer names. In this case, I would recommend re-using an already existing convention in the PyData ecosystem (as hinted at https://github.com/GenericMappingTools/pygmt/issues/1318#issuecomment-856376339). Namely that from cuml
at https://github.com/rapidsai/cuml/blob/e977f3e4194313a857da96fdd810ee5ef1d742b3/python/cuml/common/memory_utils.py#L346-L374:
Specifically. they have an output_type
parameter (similar to the data_format
used by @willschlitzer in #1336) that can be either ‘input’, ‘cudf’, ‘cupy’, ‘numpy’ (default = ‘input’). For PyGMT, I would recommend the following output_type
names:
- 'input' (default, which mirrors the input data type whenever possible)
- 'numpy' (i.e.
numpy.array
) - 'pandas' (i.e.
pandas.DataFrame
) - 'file' (i.e. output to a text file, but we need to tie this with the 'outfile' parameter)
Happy to vote on whether short or long names is preferred. Let's go with :tada: for short (a/d/s) and :rocket: for long (numpy/pandas/file).
@weiji14 No fair, rockets are way cooler so people will vote that way by default!
But seriously, I think you and @meghanrjones are making better points than I am; I think long names are the way to go.
@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.
@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.
Yes, I think we can settle it for long names.
Yes, I also support the long names.
@willschlitzer, would you mind if I worked on a follow-up to https://github.com/GenericMappingTools/pygmt/pull/1336 with a helper function using the code that you've written for returning different output types depending on output_type
? The helper function would be useful for me to get the grdhisteq function in good-enough shape for a review before the next release.
@meghanrjones Please do!
I have a pretty general question that relates to this issue. The standardized code for returning tabular output (this issue) or raster data (https://github.com/GenericMappingTools/pygmt/issues/1400) relies on using pandas or xarray to read temporary text or netCDF files respectively. What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output? Probably a question for @seisman @weiji14 or @leouieda.
What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output?
If there is a way to do that (store the GMT-processed output purely in-memory and access it in Python), I'm all ears. My understanding is that the GMT C API is not yet mature enough to have this capability, but I could also be wrong :slightly_smiling_face:
What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output?
If there is a way to do that (store the GMT-processed output purely in-memory and access it in Python), I'm all ears. My understanding is that the GMT C API is not yet mature enough to have this capability, but I could also be wrong 🙂
I don't have enough experience with the ctypes library at this point to have the answer to the 'access it in Python' part, but the GMT C API can return the output as an in-memory object through the virtual file mechanism (e.g., https://github.com/GenericMappingTools/gmt/blob/master/src/testapi_spectrum1d.c). It look's like Joaquim's code for GMT.jl uses virtual files for output combined with GMT_Read_VirtualFile, so I would be very surprised if it's not possible for Python. That said, it's clearly more complicated than writing a temporary file and reading with an xarray function (e.g., building a Julia type grid from a GMT grid).
I think we have reached an agreement (at least partially) about the expected behavior of table-like output. I've revised the top post to better track the progress.
Just noting that Pandas 2.0 has not just a NumPy-backed arrays, but also PyArrow(C++)-backed arrays (see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes). Also according to PDEP10, Pandas 3.0 will use ArrowDtype
instead of object
dtype for string columns.
Haven't had time yet to work out what this would mean for our implementation of GMT_DATASET
at #2729, but we might need some matrix tests to check that both NumPy and PyArrow-backed pandas.DataFrame
s work.
Now most modules already have consistent behaviors for table-like output. There are still four exceptions: x2sys_cross
is complicated and need more work; for which
/info
/grdinfo
, file
/numpy
/pandas
output types may make no sense, so they need more discussions and will be tracked in separate issues.
Closing the issue. Great to have consistent table-like output behavior across the whole project!
Closing the issue. Great to have consistent table-like output behavior across the whole project!
Sounds great! Thanks @seisman for all your efforts on this!