ValueError: cannot handle a non-unique multi-index!
This might be related to Dask not supporting multi-indexes. My code was randomly failing, which made me first assume there was a problem in the input data. Running with the versions
dask: 1.2.2 numpy: 1.16.3 pandas: 0.24.2
the minimal example below fails. Is there a way of making this error message more intuitive? Or having this operation working.
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'ind_a': np.arange(100), 'ind_b': 1, 'var': 'whatever'})
df = dd.from_pandas(df, npartitions=90)
# Only fails when grouping with two variables.
df['nr'] = df.groupby(['ind_a', 'ind_b']).cumcount()
len(df)
Note that this only happens when we have multiple threads working with the same data.
df['nr'] = df.groupby(['ind_a', 'ind_b']).cumcount()
len(df)
```python
In [4]: dask.config.set(scheduler='multiprocessing')
Out[4]: <dask.config.set at 0x113d8b208>
In [5]: df['nr'] = df.groupby(['ind_a', 'ind_b']).cumcount()
...: len(df)
...:
Out[5]: 100
In [6]: dask.config.set(scheduler='single-threaded')
Out[6]: <dask.config.set at 0x113d9ae80>
In [7]: df['nr'] = df.groupby(['ind_a', 'ind_b']).cumcount()
...: len(df)
Seems like some piece of data is being manipulated inplace by two threads.
If that's true, would it be possible to inject copies in-between steps to identify where the in-place manipulation is occurring?
I'm not sure. I think a more productive line is to demonstrate the thread safety issue with pandas-only code and a ThreadPool, and see if changes can be made to pandas. @marberi are you interested in exploring that?
@TomAugspurger Could you outline how I could make such a test? I am quite familiar with Pandas from as a user, but not how one would debug this issue.
I'm not entirely sure either. This is a bit tricky, since it's not clear what section of pandas is mutating the shared object.
I would try a couple approaches:
- Use a ThreadPoolExecutor to apply multiple reductions (cumcount, commix) to the same DataFrameGroupBy object at the same time. That would help narrow down whether the issue is in pandas' groupby code, or in the indexing dask does before getting there.
- Try adding finer and finer-grained thread locks on the functions used as part of dask's cumcount.
This still may not end up finding this issue... This is a tricky one.
On Fri, May 31, 2019 at 8:29 AM marberi [email protected] wrote:
@TomAugspurger https://github.com/TomAugspurger Could you outline how I could make such a test? I am quite familiar with Pandas from as a user, but not how one would debug this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOISAOR2GR6RHAOGROKDPYER2LA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVG6MQ#issuecomment-497708850, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIT4FYSGJXXE7S4GB73PYER2LANCNFSM4HPW3SIA .
@marberi, does Tom's suggested course of action above make sense? Do you have additional questions?
I will have a try. Tried some simple debugging this morning, inserting some print statements in the Pandas code to see if I could find some hints. Quite tricky, since the state at times changes between "is_unique" and checking what are duplicates. I keep looking. However, testing if upgrading Pandas to the latest version gives another issue (see attachment). Are you aware of this problem? It seems unrelated, but something which should be fixed. new_message.txt
Opened https://github.com/dask/dask/issues/4880 for that new message.
https://github.com/pandas-dev/pandas/issues/21150 may be relevant for the original issue, but there isn't much info there.
On Tue, Jun 4, 2019 at 7:59 AM marberi [email protected] wrote:
I will have a try. Tried some simple debugging this morning, inserting some print statements in the Pandas code to see if I could find some hints. Quite tricky, since the state at times changes between "is_unique" and checking what are duplicates. I keep looking. However, testing if upgrading Pandas to the latest version gives another issue (see attachment). Are you aware of this problem? It seems unrelated, but something which should be fixed. new_message.txt https://github.com/dask/dask/files/3252464/new_message.txt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOIU7ZDZKE3ZQHY5DLOTPYZRLLA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW4PMYQ#issuecomment-498660962, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOISF7TF5J7DAA3DVDH3PYZRLLANCNFSM4HPW3SIA .
Thanks. It looks very relevant. The Pandas code called by dask crashes in multiple places, but testing with "is_unique" seems to always be involved.
Yes, I missed that earlier. You might try adding a threading.Lock around
uses of is_unique / do_unique_check just to verify.
On Tue, Jun 4, 2019 at 8:31 AM marberi [email protected] wrote:
Thanks. It looks very relevant. The Pandas code called by dask crashes in multiple places, but testing with "is_unique" seems to always be involved.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOIRC2VVULU3PVUA4E3LPYZVCDA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW4SIHQ#issuecomment-498672670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIVJVNPS2YE3KQOCSRLPYZVCDANCNFSM4HPW3SIA .
I tried adding a lock in "is_unique" in the Pandas source code, as shown below. Then rerunning I still have the same issue. Does this look like a correct usage of locking?
@property
def is_unique(self):
import threading
lock = threading.Lock()
lock.acquire_lock()
print('is rebuilt: 2')
if self.need_unique_check:
self._do_unique_check()
ans = self.unique == 1
lock.release()
return ans
Yes. Pandas has multiple index classes, and some override is_unique with their own implementations.
On Tue, Jun 4, 2019 at 9:39 AM marberi [email protected] wrote:
I tried adding a lock in "is_unique" in the Pandas source code, as shown below. Then rerunning I still have the same issue. Does this look like a correct usage of locking?
@property def is_unique(self): import threading lock = threading.Lock() lock.acquire_lock() print('is rebuilt: 2') if self.need_unique_check: self._do_unique_check() ans = self.unique == 1 lock.release() return ans— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOITLRWXXR7YAP445JLDPYZ5CJA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW4ZDKI#issuecomment-498700713, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIROQDGLS3IPSQSI3ZDPYZ5CJANCNFSM4HPW3SIA .
Ok, I got one step further concerning this. The "is_unique" method in the Index class found in: pandas/pandas/core/indexes/base.py
has a "@cache_readonly" decorator. When commenting out this cache, this problem disappears. This appears to be a custom decorator implemented in the Pandas code.
Yeah, the output of is_unique is computed once and cached for the
lifetime of the object. Future calls to is_unique won't even reach the
lock, after the value has been cached.
On Tue, Jun 4, 2019 at 10:22 AM marberi [email protected] wrote:
Ok, I got one step further concerning this. The "is_unique" method in the Index class found in: pandas/pandas/core/indexes/base.py
has a "@cache_readonly" decorator. When commenting out this cache, this problem disappears. This appears to be a custom decorator implemented in the Pandas code.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOIVFWHIQQY4DIX7M2ZLPY2CFBA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW45S2Q#issuecomment-498719082, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOITGSFPVMDLEY52BQVLPY2CFBANCNFSM4HPW3SIA .
Would this be an acceptable fix? I tested acquiring locks inside the __get__ method itself and
it did not work.
https://github.com/marberi/pandas/commit/ad33858298961d63ab0bc70caf1d04b3ca02b5fc
May be worth posting on the pandas issue I linked to. I'm not familiar with pandas' indexing code.
On Tue, Jun 4, 2019 at 2:38 PM marberi [email protected] wrote:
Would this be an acceptable fix? I tested acquiring locks inside the get method itself and it did not work.
marberi/pandas@ad33858 https://github.com/marberi/pandas/commit/ad33858298961d63ab0bc70caf1d04b3ca02b5fc
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4845?email_source=notifications&email_token=AAKAOITNRWG3PNPRSNXXF7DPY3ADTA5CNFSM4HPW3SIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5UQMI#issuecomment-498812977, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOISJSOA6YAFCOMG5U6TPY3ADTANCNFSM4HPW3SIA .
Ok, also posted there. Let's see what they say.
Is this still an issue, @marberi?
@jakirkham
Creating a new environment with: dask: 2.20.0 numpy: 1.18.5 pandas: 1.0.5
I can confirm the error is still there. Testing this last year, we could localize where the problem came from, but without determining the best way to develop a fix.
definitely still seeing an issue. note as a workaround, I was able to specify a single column as a concat of the two values being grouped on and get things to move forward and around the issue
old code:
ddf['NEW_VALUE'] = ddf.groupby(['GROUP_KEY_1', 'GROUP_KEY_2'])['value'].cumsum()
new code to work around the issue:
ddf['GROUPER'] = ddf['GROUP_KEY_1'].astype(str) + ddf['GROUP_KEY_2'].astype(str)
ddf['NEW_VALUE'] = ddf.groupby(['GROUPER'])['value'].cumsum()
Reporting another bug, I wanted to test if the problem reported here still is an issue. The code in the original report (https://github.com/dask/dask/issues/4845#issue-448576561) appears to work fine using Dask version 2022.9.1 (Linux, Python 3.10.6, Conda). Could someone else confirm this?
Yeah am no longer seeing the exception in the original example. Maybe it is worth adding this as a test to Dask (to protect against regression). Would this be of interest to you @marberi?
@jakirkham Added a test here: https://github.com/dask/dask/pull/9506
as mentioned, when running "pytest" I could not get it to run the new test. It might be something trivial. Tests are still running.
Few years later, and this is still an error. I'm using Dask 2023.7.1 and this error happened to me. It was a pain in the ass to finally see that this is a stochastic behavior and sometimes the code works and sometimes it doesn't. Even trying Dask 2024.3.1 (which is the latest at the point I'm writing this), this still happens.
Does someone have an update on this error? How can an issue be open for such a long time? =/
Not from my side. Hopefully this gets fixed.
@Giatroo do you have a reproducer? The initial example works now
The initial example still doesn't work. You just have to make a few tries, since it's an error that appears randomly.
Here's a print of an env I've just created by installing dask[dataframe] on the latest version.