pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: large pivot_table has incorrect output with Python 3.14

Open joshuanapoli opened this issue 4 weeks ago • 1 comments

Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import sys

import pandas as pd

print(f"Python version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print()

num_indices = 100000  # OK with 10,000; fails with 100,000
metrics = [
    "apple",
    "banana",
]

data_rows = []
for idx in range(num_indices):
    data_rows.append({"idx": idx, "metric": "apple", "value": 2 * idx})
    data_rows.append({"idx": idx, "metric": "banana", "value": 3 * idx})
    data_rows.append({"idx": idx, "metric": "coconut", "value": 4 * idx})

df = pd.DataFrame(data_rows)

print(f"Generated dataset: {len(df):,} rows")
print(f"Expected rows after pivot: {num_indices:,}")
print()

print("Pivoting data...")
pivoted = df.pivot_table(
    index=["idx"],
    columns="metric",
    values="value",
    aggfunc="first",
)

print("After pivot:")
print(f"  Total rows: {len(pivoted):,}")
print(f"  Unique indices: {pivoted.index.nunique():,}")
print(f"  Has duplicate indices: {pivoted.index.duplicated().any()}")

if pivoted.index.duplicated().any():
    print("  BUG: DUPLICATE INDICES")
    print()
    print("Example duplicates:")
    dup_indices = pivoted.index[pivoted.index.duplicated(keep=False)]
    for idx in dup_indices.unique()[:3]:
        print(pivoted.loc[idx])
        print()
else:
    print()
    print("OK")

status = 0 if not pivoted.index.duplicated().any() else 1
sys.exit(status)

Issue Description

With Python 3.14, the pivot_table function gives a corrupted output when the input is large. On smaller input (fewer rows or columns), the output is correct. The example code shows duplicated index values. In my production application, I see both missing output rows and duplicated index values.

With Python 3.13, the pivot_table function always gives a correct output.

I'm testing on pandas 2.3.3 and 3.0.0rc0+13.g8be8439bce.

Here is the failing output from the test program:

joshuanapoli@mac cvec-data-analysis % poetry run python pandas_bug_report.py
Python version: 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.4.4.1)]
pandas version: 3.0.0rc0+13.g8be8439bce

Generated dataset: 300,000 rows
Expected rows after pivot: 100,000

Pivoting data...
After pivot:
  Total rows: 100,000
  Unique indices: 33,334
  Has duplicate indices: True
  BUG: DUPLICATE INDICES

Example duplicates:
metric  apple  banana  coconut
idx
1           2       3        4
1           4       6        8
1           6       9       12

metric  apple  banana  coconut
idx
2           8      12       16
2          10      15       20
2          12      18       24

metric  apple  banana  coconut
idx
3          14      21       28
3          16      24       32
3          18      27       36

Expected Behavior

Python version: 3.13.3 (main, Apr 8 2025, 13:54:08) [Clang 16.0.0 (clang-1600.0.26.6)] pandas version: 3.0.0rc0+13.g8be8439bce

Generated dataset: 300,000 rows Expected rows after pivot: 100,000

Pivoting data... After pivot: Total rows: 100,000 Unique indices: 100,000 Has duplicate indices: False

OK

Installed Versions

INSTALLED VERSIONS

commit : 8be8439bce89c30f7ebf4db7a01bf79143e6bcae python : 3.14.2 python-bits : 64 OS : Darwin OS-release : 25.1.0 Version : Darwin Kernel Version 25.1.0: Mon Oct 20 19:34:05 PDT 2025; root:xnu-12377.41.6~2/RELEASE_ARM64_T6041 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : C.UTF-8

pandas : 3.0.0rc0+13.g8be8439bce numpy : 1.26.4 dateutil : 2.9.0.post0 pip : 25.0.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : 3.10.7 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 psycopg2 : None pymysql : None pyarrow : 22.0.0 pyiceberg : None pyreadstat : None pytest : 9.0.2 python-calamine : None pytz : None pyxlsb : None s3fs : None scipy : 1.16.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None qtpy : None pyqt5 : None

joshuanapoli avatar Dec 09 '25 18:12 joshuanapoli

Thanks for the report. On pandas 2.3, are you setting copy_on_write to either True or False? If you haven't yet tested, can you test this with copy_on_write set to False?

rhshadrach avatar Dec 09 '25 23:12 rhshadrach

@joshuanapoli do you know how you got numpy 1.26 installed in the Python 3.14 env? (I don't think there are python 3.14 wheels for such an old numpy version, so something forced to get an older numpy installed from source?)

jorisvandenbossche avatar Dec 12 '25 14:12 jorisvandenbossche

According to the analysis of @AKHIL-149 in https://github.com/pandas-dev/pandas/pull/63324, the issue might be due to the combination of Python 3.14 and numpy 1.26. But since numpy 1.26 was released well before Python 3.14, that is not really a supported combination (I don't think that NumPy will still consider doing fixes to 1.x for newer Python versions, at the time it was released it supported up to Python 3.12)

jorisvandenbossche avatar Dec 12 '25 15:12 jorisvandenbossche