pygmt icon indicating copy to clipboard operation
pygmt copied to clipboard

Consistent table-like output for PyGMT functions/methods

Open maxrjones opened this issue 3 years ago • 22 comments

Originally posted by @weiji14 in https://github.com/GenericMappingTools/pygmt/pull/1284#discussion_r639581006:

Currently, the blockm*/grdtrack modules simply outputs pandas.DataFrame/filenames (see #1099) automatically depending on the input type (2 options). The implementation @willschlitzer made at bf58dc2 and 68fab7c returns either numpy/pandas/str depending on the user input (3 options). Best to be consistent across PyGMT on the types of outputs users can expect.

I am opening this issue to find out what the output format/options should be with the additional of x/y/z in blockmean, blockmedian, and grdtrack, as well as for new methods such as grd2xyz.


Edit by @seisman on Oct 9, 2023:

The PyGMT team has reach an agreement that PyGMT functions/methods should have consistent behavior for table-like outputs, i.e., the output depends on the output_type parameter.

Valid output_type values are:

  • file: output to a file directly (usually need to set outfile)
  • numpy: return a NumPy array (may not always possible because data must be in the same dtype in a 2-D numpy array)
  • pandas: return a pandas.DataFrame (sometimes need to set newcolname)

Here are a list of functions/methods that have table-like outputs:

  • [x] blockm* #3103
  • [x] filter1d #3085
  • [ ] grdinfo
  • [x] grdtrack #3106
  • [x] triangulate #3107
  • [x] grdvolume #3102
  • [x] select #3108
  • [ ] which
  • [x] grdhisteq #3109
  • [ ] x2sys_cross
  • [ ] info
  • [x] project #3110
  • [x] grd2xyz #3097

We need to make sure they behave consistently.

maxrjones avatar Jun 04 '21 16:06 maxrjones

Just wanted to say, this is a similar issue to that in the RAPIDS AI cuml library. See https://docs.rapids.ai/api/cuml/0.19/api.html#output-data-type-configuration. Their solution is to:

  1. By default: 'mirror' the input data format to be the same as the output format as much as possible, i.e. numpy in/numpy out, or pandas.DataFrame in/pandas.DataFrame out.
  2. Override the default behaviour above by having a using_output_type context manager. E.g. set using_output_type=numpy to return numpy regardless of input type.

As you can see, this involves more levels of I/O complexity going from phase 1 to phase 2. I would advocate to move to phase '1' for PyGMT v0.5.0 or v0.6.0, and keep PyGMT's default output type for table-like outputs as pandas.DataFrame (or None if outfile is set) for PyGMT v0.4.0 as per #1099 (i.e phase '0' if you will). Longer term, as PyGMT matures and integrates more with the PyData stack, we may want to consider mimicking cuml's configurable I/O behaviour.

That said, I'm open to other suggestions, if someone has a good idea on how to deal with I/O in PyGMT.

weiji14 avatar Jun 06 '21 00:06 weiji14

Thanks for the helpful reference, @weiji14! I like the mirroring behavior and agree that it would be a good milestone for v0.5.0+.

In my opinion, a parameter to control the output type seems a bit simpler for users than a context manager (examples below). The options for pygmt.using_output_type may also depend on the specific function/method, which could become difficult to document. The context manager would be useful if users want to easily specify the output type for a block of code. Are there other benefits to implementing the configurable I/O behavior with a context manager rather than with function/method parameters?

Context manager example:

with pygmt.using_output_type("numpy"):
   output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30])

Parameter example:

output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30], output_format="numpy")

maxrjones avatar Jun 08 '21 01:06 maxrjones

You're right that controlling the output type via a parameter (similar to what @willschlitzer did in bf58dc2defc3b6ff67e4769322505d63e4ad941d) would be simpler than a context manager. I don't have a preference for either style, but after doing a bit of digging, it seems like cuml had an output_type parameter at one point, but they deprecated it in favour of the context manager option in https://github.com/rapidsai/cuml/pull/3040 for some reason (though it still seems to be mentioned in places like at https://github.com/rapidsai/cuml/blob/branch-21.08/wiki/python/ESTIMATOR_GUIDE.md#specifying-the-array-output-type).

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2defc3b6ff67e4769322505d63e4ad941d, but have it in a single standardized place.

weiji14 avatar Jun 08 '21 01:06 weiji14

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.

Yes, I agree that a helper function is a good idea for the implementation of an output format choice.

maxrjones avatar Jun 10 '21 18:06 maxrjones

I think the helper function would be our best bet (although I'm not too strong with writing decorators). Borrowing/stealing entirely from my commits mentioned above, I think something like this would could be helpful:

def return_table(result, data_format, format_parameter):
    if data_format == "s":
        return result
    data_list = []
    for string_entry in result.strip().split("\n"):
        float_entry = []
        string_list = string_entry.strip().split()
        for i in string_list:
            float_entry.append(float(i))
        data_list.append(float_entry)
    data_array = np.array(data_list)
    if data_format == "a":
        result = data_array
    elif data_format == "d":
        result = pd.DataFrame(data_array)
    else:
        raise GMTInvalidInput(f"""Must specify {format_parameter} as either a, d, or s.""")
    return result

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

willschlitzer avatar Jun 15 '21 10:06 willschlitzer

The return_table function looks to be on the right track. We can discuss this in the meeting tomorrow (https://forum.generic-mapping-tools.org/t/pygmt-community-zoom-meeting-tuesday-15-june-2021-utc-1700/1762) if there's time.

In terms of your PRs #1284 and #1299 though @willschlitzer, I'd suggest sticking to a str or pandas.DataFrame only output for PyGMT v0.4.0 if you want to get those PRs in on time (and not wait until v0.5.0). I.e. follow the code used by grdtrack and blockm*: https://github.com/GenericMappingTools/pygmt/blob/a10af5d8deb3bf3090eab4b6492bcf8cf722cb71/pygmt/src/grdtrack.py#L273-L281

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

You're welcome to start a proof of concept PR, but realistically speaking, I don't expect it to be ready for PyGMT v0.4.0. See this blog post by the RAPIDS AI team (https://medium.com/rapids-ai/making-rapids-cudf-io-readers-and-writers-more-reliable-with-fuzz-testing-5d595aa97389) which talks about the complexity of testing different I/O format combinations. Just want to manage expectations a bit :slightly_smiling_face:

weiji14 avatar Jun 15 '21 11:06 weiji14

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

willschlitzer avatar Jun 16 '21 06:06 willschlitzer

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

Thanks for the detailed explanation and for submitting the proof-of-concept PR. I understand your points. I'll add some comments on https://github.com/GenericMappingTools/pygmt/pull/1336, but I agree with you that it's not likely we will reach consensus and update the other methods/functions with table output in the next few days. So, it seems https://github.com/GenericMappingTools/pygmt/issues/1318, https://github.com/GenericMappingTools/pygmt/pull/1336, https://github.com/GenericMappingTools/pygmt/pull/1299, and https://github.com/GenericMappingTools/pygmt/pull/1284 will be on track for v0.5.0.

maxrjones avatar Jun 17 '21 15:06 maxrjones

I support Will wanting to merge in https://github.com/GenericMappingTools/pygmt/pull/1299 and https://github.com/GenericMappingTools/pygmt/pull/1284 before addressing this issue to keep things moving. At the same time, I would prefer that the argument format for the output_type parameter is final (i.e., will be used for all applicable functions) before merging these PRs to avoid deprecations. Here's the current section:

    output_type : str
        Determine the format the xyz data will be returned in:
            **a**: numpy array
            **d**: pandas DataFrame [Default option]
            **s**: string

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

maxrjones avatar Aug 10 '21 14:08 maxrjones

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

willschlitzer avatar Aug 10 '21 19:08 willschlitzer

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'. Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

Agree with having longer names. In this case, I would recommend re-using an already existing convention in the PyData ecosystem (as hinted at https://github.com/GenericMappingTools/pygmt/issues/1318#issuecomment-856376339). Namely that from cuml at https://github.com/rapidsai/cuml/blob/e977f3e4194313a857da96fdd810ee5ef1d742b3/python/cuml/common/memory_utils.py#L346-L374:

Specifically. they have an output_type parameter (similar to the data_format used by @willschlitzer in #1336) that can be either ‘input’, ‘cudf’, ‘cupy’, ‘numpy’ (default = ‘input’). For PyGMT, I would recommend the following output_type names:

  • 'input' (default, which mirrors the input data type whenever possible)
  • 'numpy' (i.e. numpy.array)
  • 'pandas' (i.e. pandas.DataFrame)
  • 'file' (i.e. output to a text file, but we need to tie this with the 'outfile' parameter)

Happy to vote on whether short or long names is preferred. Let's go with :tada: for short (a/d/s) and :rocket: for long (numpy/pandas/file).

weiji14 avatar Aug 10 '21 21:08 weiji14

@weiji14 No fair, rockets are way cooler so people will vote that way by default!

But seriously, I think you and @meghanrjones are making better points than I am; I think long names are the way to go.

willschlitzer avatar Aug 11 '21 05:08 willschlitzer

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

willschlitzer avatar Sep 13 '21 10:09 willschlitzer

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

Yes, I think we can settle it for long names.

maxrjones avatar Sep 13 '21 11:09 maxrjones

Yes, I also support the long names.

seisman avatar Sep 13 '21 12:09 seisman

@willschlitzer, would you mind if I worked on a follow-up to https://github.com/GenericMappingTools/pygmt/pull/1336 with a helper function using the code that you've written for returning different output types depending on output_type? The helper function would be useful for me to get the grdhisteq function in good-enough shape for a review before the next release.

maxrjones avatar Dec 28 '21 17:12 maxrjones

@meghanrjones Please do!

willschlitzer avatar Dec 28 '21 18:12 willschlitzer

I have a pretty general question that relates to this issue. The standardized code for returning tabular output (this issue) or raster data (https://github.com/GenericMappingTools/pygmt/issues/1400) relies on using pandas or xarray to read temporary text or netCDF files respectively. What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output? Probably a question for @seisman @weiji14 or @leouieda.

maxrjones avatar Jan 07 '22 18:01 maxrjones

What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output?

If there is a way to do that (store the GMT-processed output purely in-memory and access it in Python), I'm all ears. My understanding is that the GMT C API is not yet mature enough to have this capability, but I could also be wrong :slightly_smiling_face:

weiji14 avatar Jan 08 '22 02:01 weiji14

What's the reason for handling the return through temporary files versus creating a second virtual file to hold in-memory output?

If there is a way to do that (store the GMT-processed output purely in-memory and access it in Python), I'm all ears. My understanding is that the GMT C API is not yet mature enough to have this capability, but I could also be wrong 🙂

I don't have enough experience with the ctypes library at this point to have the answer to the 'access it in Python' part, but the GMT C API can return the output as an in-memory object through the virtual file mechanism (e.g., https://github.com/GenericMappingTools/gmt/blob/master/src/testapi_spectrum1d.c). It look's like Joaquim's code for GMT.jl uses virtual files for output combined with GMT_Read_VirtualFile, so I would be very surprised if it's not possible for Python. That said, it's clearly more complicated than writing a temporary file and reading with an xarray function (e.g., building a Julia type grid from a GMT grid).

maxrjones avatar Jan 10 '22 16:01 maxrjones

I think we have reached an agreement (at least partially) about the expected behavior of table-like output. I've revised the top post to better track the progress.

seisman avatar Oct 09 '23 15:10 seisman

Just noting that Pandas 2.0 has not just a NumPy-backed arrays, but also PyArrow(C++)-backed arrays (see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes). Also according to PDEP10, Pandas 3.0 will use ArrowDtype instead of object dtype for string columns.

Haven't had time yet to work out what this would mean for our implementation of GMT_DATASET at #2729, but we might need some matrix tests to check that both NumPy and PyArrow-backed pandas.DataFrames work.

weiji14 avatar Oct 10 '23 22:10 weiji14

Now most modules already have consistent behaviors for table-like output. There are still four exceptions: x2sys_cross is complicated and need more work; for which/info/grdinfo, file/numpy/pandas output types may make no sense, so they need more discussions and will be tracked in separate issues.

Closing the issue. Great to have consistent table-like output behavior across the whole project!

seisman avatar Apr 08 '24 05:04 seisman

Closing the issue. Great to have consistent table-like output behavior across the whole project!

Sounds great! Thanks @seisman for all your efforts on this!

yvonnefroehlich avatar Apr 09 '24 16:04 yvonnefroehlich