dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Tracking issue: dataframe protocol implementation

Open rgommers opened this issue 4 years ago • 0 comments

The bulk of the dataframe interchange protocol was done in gh-38. There were still a number of TODOs however, and more will likely pop up once we have multiple implementations so we can actually turn one type of dataframe into another type. This is the tracking issue for those TODOs and issues:

  • [ ] Categorical dtypes: we should allow having null as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.
  • [ ] Categorical dtypes: should they be a dtype in themselves, or should they be a part of the dtype tuple? Currently dtype is (kind, bitwidth, format_str, endianness), with categorical being a value of the kind enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?
  • [x] Add a metadata attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there. See PR gh-43
  • [x] Add a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not - add a test for that). See PR gh-44
  • [x] Add a string dtype, with variable-length strings implemented with the same scheme as Arrow uses (an offsets and a data buffer, see https://github.com/data-apis/dataframe-api/pull/38#discussion_r609818874). _See PR gh-45
  • [x] Signature of the from_dataframe protocol? See https://github.com/data-apis/dataframe-api/issues/42 and meeting of 20 May.
  • [x] What can be reused between implementations in different libraries, and can/should we have a reference implementation? --> question needs answering somewhere.
  • [ ] What is the ownership for buffers, who owns the memory? This should be clearly spelled out in the docs. An owner attribute is perhaps needed. See meeting minutes 4 March, https://github.com/data-apis/dataframe-api/issues/39, and comments on this PR.

rgommers avatar Jun 25 '21 20:06 rgommers