ECLSK Data produces Nans and/or no correlation in HMA
Environment details
If you are already running SDV, please indicate the following details about the environment in which you are running it:
- SDV version:1.15
- Python version:3.10.12
- Operating System:Windows 10 (Google colab)
Problem description
When attempting to fit and synthesize data from the pub data ECLSK dataset (attached and here: https://nces.ed.gov/ecls/), several strange outcomes occur, most notably regarding the OUTCOME column, which either all comes out as one value or produces NaNs. There does not appear to be anything interesting about that column. Please advise.
What I already tried
Adjusting column dtypes, culling the dataset to fewer columns
link to colab: https://colab.research.google.com/drive/1pT81wxCReMNxam3ZP-6u3IM74R_0Czgh#scrollTo=YN16L5Ywcbou
Hi there @awesomeisfree I apologize for the late reply here. Thank you for sharing your code and datasets; I was able to reproduce the issue you were encountering on my end when using HMA Synthesizer. We unfortunately haven't yet determined the cause of the issue yet so we will leave this issue open until we find out more.
While I know this isn't immediately helpful for you because you're using SDV Community, I will mention that this problem doesn't occur when using HSA Synthesizer (which is available in SDV Enterprise). If you want to learn more about SDV Enterprise, while we investigate the issue with HMA Synthesizer, you can reach out to us here.