pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns

Open Tirthchoksi22 opened this issue 1 year ago • 30 comments

  • [x] closes #58713 (Replace xxxx with the GitHub issue number)
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added type annotations to new arguments/methods/functions.
  • [x] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Tirthchoksi22 avatar May 15 '24 07:05 Tirthchoksi22

hiii @myles the PR is ready to merge

Tirthchoksi22 avatar May 15 '24 09:05 Tirthchoksi22

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

simonjayhawkins avatar May 15 '24 09:05 simonjayhawkins

Yes, it appears that there is indeed a regression on Windows machines. The issue seems to have resurfaced after being resolved in a previous release.

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

Tirthchoksi22 avatar May 15 '24 09:05 Tirthchoksi22

@simonjayhawkins After reviewing the code and considering the previous fix, it appears that the solution implemented in this pull request is indeed identical to the one used in the previous resolution of the regression.

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

Tirthchoksi22 avatar May 15 '24 09:05 Tirthchoksi22

@simonjayhawkins Also guide me what to do next do i have to create new PR with previous fix/tests/release or this is ok ??as this would be my first Open Source Contribution

Tirthchoksi22 avatar May 15 '24 09:05 Tirthchoksi22

you can make changes to you branch and push those changes to update this PR. No need to close and open a new PR.

All bug fixes and regression fixes need a test to ensure that the issue does not resurface here. In this case, looking at the fix for np.intc the parameterization of test_join_multi_dtypes was updated. Perhaps could do the same here?

For the release note, if you (I can't check a windows only bug easily) determine for which pandas release it was a regression, or if that is many releases ago, then we maybe need a note either under the bug fixes or regression section of the release notes. If a bug fix would go in the 3.0 what's new. If a regression, would go in the 2.2.x release notes (but that depends if we are doing another patch release)

cc @lithomas1

simonjayhawkins avatar May 15 '24 10:05 simonjayhawkins

@simonjayhawkins Thank you for your feedback. Upon further review, I realized that the changes made to the merge.py file were minimal and consisted of adding just one extra line (np.uintc: libhashtable.UInt32Factorizer) to the _factorizers dictionary. The majority of the changes were indeed in the test file, where I added tests to ensure proper handling of numpy.uintc columns.

Additionally, I'd like to confirm that the release note has already been included in the bug fixes and regression section, documenting the resolution of the issue.

Given this information, I believe that the changes made align with the expectations outlined in your feedback. Please let me know if there are any further adjustments or considerations needed.

Thank you for your continued guidance and support throughout this process.

Tirthchoksi22 avatar May 15 '24 10:05 Tirthchoksi22

ci is failing,

/pre-commit

simonjayhawkins avatar May 15 '24 11:05 simonjayhawkins

ci is failing,

/pre-commit

so what to do now ??

Tirthchoksi22 avatar May 15 '24 11:05 Tirthchoksi22

I thought that comment should have triggered the bot.

simonjayhawkins avatar May 15 '24 11:05 simonjayhawkins

/pre-commit

simonjayhawkins avatar May 15 '24 11:05 simonjayhawkins

@github-actions pre-commit

simonjayhawkins avatar May 15 '24 11:05 simonjayhawkins

/pre-commit

I dont think there's a command like this

Tirthchoksi22 avatar May 15 '24 11:05 Tirthchoksi22

pre-commit.ci autofix

simonjayhawkins avatar May 15 '24 11:05 simonjayhawkins

@simonjayhawkins Could you kindly confirm if the changes proposed in this pull request are satisfactory, or if there are any additional adjustments needed before merging?

Tirthchoksi22 avatar May 15 '24 12:05 Tirthchoksi22

pre-commit.ci autofix

Tirthchoksi22 avatar May 15 '24 13:05 Tirthchoksi22

I was browsing recent pull requests and have a Windows machine so thought I would give this a quick test for you. It looks like the difference is that on Windows np.uintc is not aliased to any of the types already in the _factorizers dictionary:

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
False
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

Whereas at least in Linux it is aliased to the np.uint32 type:

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
True
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

This is probably related to Windows using the LLP64 64-bit data model vs LP64 used elsewhere (https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models). https://stackoverflow.com/questions/76155091/np-uint32-np-uintc-on-windows gives a possible explanation.

cmjcharlton avatar May 15 '24 13:05 cmjcharlton

I was browsing recent pull requests and have a Windows machine so thought I would give this a quick test for you. It looks like the difference is that on Windows np.uintc is not aliased to any of the types already in the _factorizers dictionary:

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
False
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

Whereas at least in Linux it is aliased to the np.uint32 type:

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
True
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

This is probably related to Windows using the LLP64 64-bit data model vs LP64 used elsewhere (https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models). https://stackoverflow.com/questions/76155091/np-uint32-np-uintc-on-windows gives a possible explanation.

thanks for the explanation mate it helps a lot

Tirthchoksi22 avatar May 15 '24 14:05 Tirthchoksi22

pre-commit.ci autofix

Tirthchoksi22 avatar May 15 '24 15:05 Tirthchoksi22

@simonjayhawkins your requested changes are done and this PR is ready to merge now

Tirthchoksi22 avatar May 16 '24 06:05 Tirthchoksi22

pre-commit.ci autofix

Tirthchoksi22 avatar May 16 '24 17:05 Tirthchoksi22

pre-commit.ci autofix

Tirthchoksi22 avatar May 16 '24 17:05 Tirthchoksi22

@Tirthchoksi22 on a procedural note, the pre-commit.ci autofix is useful for maintainers to correct the code. As the PR author, you can install pre-commit in your local dev environment to check the code changes you make before pushing them to the PR.

simonjayhawkins avatar May 16 '24 17:05 simonjayhawkins

@Tirthchoksi22 on a procedural note, the pre-commit.ci autofix is useful for maintainers to correct the code. As the PR author, you can install pre-commit in your local dev environment to check the code changes you make before pushing them to the PR.

ok sorry I didn't know about that

Tirthchoksi22 avatar May 16 '24 17:05 Tirthchoksi22

Can you please use that comment now because i didnt have install pre-commit in my local environment right now

Tirthchoksi22 avatar May 16 '24 17:05 Tirthchoksi22

Can you please use that comment now because i didnt have install pre-commit in my local environment right now

feel free to issue that command yourself. I was not suggesting that you could not use it. (However, if you look at the pandas documentation, you will see a section for setting up a development environment)

simonjayhawkins avatar May 16 '24 17:05 simonjayhawkins

pre-commit.ci autofix

Tirthchoksi22 avatar May 16 '24 18:05 Tirthchoksi22

pre-commit.ci autofix

Tirthchoksi22 avatar May 16 '24 19:05 Tirthchoksi22

@simonjayhawkins @WillAyd @mroeschke Guys any update ???

Tirthchoksi22 avatar May 17 '24 17:05 Tirthchoksi22

@simonjayhawkins ??

Tirthchoksi22 avatar May 22 '24 14:05 Tirthchoksi22