taxdata icon indicating copy to clipboard operation
taxdata copied to clipboard

What is relationship between XTOT and (nu18+n1820+n21)?

Open martinholmer opened this issue 7 years ago • 17 comments

Shouldn't the value of XTOT be equal to the value of (nu18+n1820+n21) for every filing unit in both the CPS data and in the PUF data?

martinholmer avatar Feb 12 '18 17:02 martinholmer

When I add a test of my expectations, I get this puzzling result:

______________________________ test_ubi_variables ______________________________

cps_path = '/Users/mrh/work/OSPC/tax-calculator/taxcalc/tests/../cps.csv.gz'

    @pytest.mark.one
    def test_ubi_variables(cps_path):
        """
        Ensure that the three UBI head-count variables add up to XTOT variable.
        """
        cpsdf = pd.read_csv(cps_path)
        xtot = cpsdf['XTOT']
        nsum = cpsdf['nu18'] + cpsdf['n1820'] + cpsdf['n21']
        if not np.allclose(xtot, nsum):
            print 'number of diffs', np.sum(xtot != nsum)
            print 'number xtot < nsum', np.sum(xtot < nsum)
            print 'number xtot > nsum', np.sum(xtot > nsum)
>           assert 'XTOT' == '(nu18+n1820+n21)'
E           AssertionError: assert 'XTOT' == '(nu18+n1820+n21)'
E             - XTOT
E             + (nu18+n1820+n21)

tests/test_cpscsv.py:140: AssertionError
----------------------------- Captured stdout call -----------------------------
number of diffs 17944
number xtot < nsum 679
number xtot > nsum 17265
============================= 441 tests deselected =============================
=================== 1 failed, 441 deselected in 6.81 seconds ==============

Are my expectations unreasonable?

Did I make a mistake in developing the new test?

I don't understand what's going on here.

@MattHJensen @Amy-Xu @andersonfrailey

martinholmer avatar Feb 12 '18 17:02 martinholmer

@martinholmer, my initial reaction is that test should work.

For households with multiple tax units, we check to see if any of the units should be combined. It's possible that there's an issue when they're combined that causes the discrepancy you've pointed out. I'll need to look into more before I can give you a definitive answer.

andersonfrailey avatar Feb 12 '18 19:02 andersonfrailey

@andersonfrailey, Thanks for the quick feedback on issue #149.

martinholmer avatar Feb 12 '18 19:02 martinholmer

@andersonfrailey, It would seem to me that resolving issue #149 should have a high priority because the accuracy of reforms that repeal benefits and add a UBI are open to question.

martinholmer avatar Feb 13 '18 21:02 martinholmer

@martinholmer I agree. I'll post updates to this issue as I pin down the exact issue.

andersonfrailey avatar Feb 14 '18 14:02 andersonfrailey

Briefly going back to your initial question in this issue, @martinholmer.

Shouldn't the value of XTOT be equal to the value of (nu18+n1820+n21) for every filing unit in both the CPS data and in the PUF data?

This might not be the case with PUF data. nu18, n1820, and n21 are pulled from the CPS generated tax units that are matched to PUF units, while XTOT is from the IRS-PUF and capped at 5. Thus, there could be some variation between XTOT and the sum of the UBI variables in the PUF.

That being said, the test you posted using the CPS file should pass. XTOT is defined in the SAS scripts as XXTOT = TXPYE + DEPNE;. Where TXPYE is equal to 2 if it is a joint unit and 1 otherwise. DEPNE is the number of dependents in the unit and incremented by 1 each time a new dependent is added. We increment the UBI age variables initially when the tax unit for the head of the unit, then whenever a spouse or dependent is added. Like I mentioned in my previous comment, the issue is likely due to tax units being combined and something being ignored.

andersonfrailey avatar Feb 14 '18 15:02 andersonfrailey

@andersonfrailey said in taxdata issue #149:

Shouldn't the value of XTOT be equal to the value of (nu18+n1820+n21) for every filing unit in both the CPS data and in the PUF data?

This might not be the case with PUF data. nu18, n1820, and n21 are pulled from the CPS generated tax units that are matched to PUF units, while XTOT is from the IRS-PUF and capped at 5. Thus, there could be some variation between XTOT and the sum of the UBI variables in the PUF.

Thanks for pointing this out about XTOT in the PUF. But it seems even more complicated than you suggest. Here are two parts of the 2011 PUF documentation:

screen shot 2018-02-14 at 1 00 11 pm

and

screen shot 2018-02-14 at 1 01 02 pm

So, as you say, there are some filing units for which the identity XTOT==(nu18+n1820+n21) should not be expected to hold (because of the limits that are imposed on the components of XTOT, which are not in the puf.csv file).

So, it will be difficult to write an accurate XTOT==(nu18+n1820+n21) test for the PUF data in Tax-Calculator because the puf.csv file does not contain the six components of XTOT. About all I can do is see if there are any filing units in the puf.csv file where XTOT > (nu18+n1820+n21), which would be a genuine inconsistency, right? Does it seem correct to say that any filing unit with XTOT < (nu18+n1820+n21) may be consistent but just have had XTOT capped by IRS?

And, as we agree, none of these considerations apply to the XTOT==(nu18+n1820+n21) test on CPS data.

martinholmer avatar Feb 14 '18 18:02 martinholmer

@andersonfrailey, I've revised the PUF data test comparing the values of XTOT and (nu18+n1820+n21) for each filing unit in the new puf.csv file. The revisions try to account for the fact that the XTOT value may be capped in the PUF data. But I'm still getting 13442 (out of 249087) filing units with an XTOT value greater than its (nu18+n1820+n21) value.

Does it look as if I have done the test correctly?

============================= test session starts ==============================
platform darwin -- Python 2.7.14, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/mrh/work/OSPC/tax-calculator, inifile: setup.cfg
plugins: xdist-1.17.1
collected 443 items                                                             

tests/test_pufcsv.py F

=================================== FAILURES ===================================
_____________________________ test_ubi_n_variables _____________________________

puf_path = '/Users/mrh/work/OSPC/tax-calculator/taxcalc/tests/../../puf.csv'

    @pytest.mark.requires_pufcsv
    @pytest.mark.one
    def test_ubi_n_variables(puf_path):
        """
        Ensure that the three UBI n* variables add up to XTOT variable,
        recognizing that XTOT values are often capped in the IRS-SOI PUF,
        so that XTOT < NSUM might not indicate any data inconsistency.
        """
        pufdf = pd.read_csv(puf_path)
        xtot = pufdf['XTOT']
        nsum = pufdf['nu18'] + pufdf['n1820'] + pufdf['n21']
        if not np.sum(xtot > nsum) == 0:
            print('number xtot > nsum is:', np.sum(xtot > nsum))
>           assert 'XTOT' <= '(nu18+n1820+n21)'
E           AssertionError: assert 'XTOT' <= '(nu18+n1820+n21)'

tests/test_pufcsv.py:360: AssertionError
----------------------------- Captured stdout call -----------------------------
number xtot > nsum is: 13442
============================= 442 tests deselected =============================
=================== 1 failed, 442 deselected in 4.40 seconds ===================

martinholmer avatar Feb 14 '18 18:02 martinholmer

That looks better. We control for number of dependents and filing type during our matching process so except in the cases where the unit is more than five people, everything should line up accordingly.

andersonfrailey avatar Feb 14 '18 21:02 andersonfrailey

@andersonfrailey said:

That [revised test] looks better. We control for number of dependents and filing type during our matching process so except in the cases where the unit is more than five people, everything should line up accordingly.

OK. So, the 13,442 filing units in the new puf.csv file with XTOT > (nu18+n1820+n21) are still under investigation, right?

martinholmer avatar Feb 14 '18 22:02 martinholmer

@martinholmer asked:

OK. So, the 13,442 filing units in the new puf.csv file with XTOT > (nu18+n1820+n21) are still under investigation, right?

Right. I've spent the day looking through the problems with the CPS file to see what type of filer has the most errors and how large they are.

About 96% of the units that fail the test fail because the UBI variables are smaller than XTOT. About 95% of the units with errors have one fewer UBI recipient than XTOT implies.

Here's the distribution: faildist

On the y-axis is the number of failing units, the x is the "error". A negative error implies UBI recipients > XTOT. A positive error implies UBI recipients < XTOT.

I still haven't pinned down why some units have this problem and others don't. In any group of filers (dependent, single, joint, head of household), at most 6% of the units are affected.

Dependent filers account for 4% of the error units. Of the error units, 100% of those where the number of UBI recipients is larger than XTOT are dependent filers. This leads me to believe that some of the dependent filers are being assigned UBI recipients for the tax unit that claims them as well as themselves. The difference is usually one or two additional UBI recipients, but there is one unit that shows 11 total UBI recipients and has XTOT = 1.

Joint filers account for about 85% of the error units. Of the 15,271 joint filers with this error, all but five show that XTOT greater than UBI recipients by one or two.

Because this only occurs in about 4% of all tax units, I'm inclined to believe there is an issue with how those units are flagged during the creation process. I'll be primarily looking into that over the next couple of days.

andersonfrailey avatar Feb 14 '18 23:02 andersonfrailey

I've spent the past couple of days digging into this issue and wanted to give an update on what I've been able to find.

The root of the issue is in our matching process. During the match we break up both files into groups and match within groups. Our data is split by the number of dependents and filing status (along with a few other variables), but not by an actual count of the number of people in a tax unit.

I haven't figured out why tax units with the same filing status and number of dependents would differ like we're seeing, but when I added a control for the number of people in the unit it eliminated the issue of XTOT being greater than the sum of nu18, n1820, and n21. There are 5,578 (unweighted) units where XTOT is less than the sum of nu18, n1820, and n21, but as discussed previously in this issue that is reasonable given the cap on XTOT in the PUF.

I haven't looked at what (if any) effects this change has had on tax calculations, but I'll come up with some numbers and post them in the PR with the updates to the matching scripts.

cc @martinholmer

andersonfrailey avatar May 01 '18 19:05 andersonfrailey

@andersonfrailey, Thanks for looking into taxdata issue #149.
Your plan for resolving this issue sounds sensible.

martinholmer avatar May 01 '18 20:05 martinholmer

Fixed for CPS data in PR #151 and for PUF data in PR #188.

martinholmer avatar May 23 '18 20:05 martinholmer

I'm still seeing this issue, e.g. RECID 247186 has XTOT=1, nu18=2, n1820=0, n21=3. Here's the distribution of (nu18+n1820+n21) - max(XTOT, 1), limited to those with the UBI variables <= 5 due to XTOT top-coding (notebook): image

MaxGhenis avatar Feb 21 '19 21:02 MaxGhenis

Should this be reopened?

MaxGhenis avatar Mar 15 '19 07:03 MaxGhenis

@MaxGhenis, I'll reopen this until I get some more time to look into the issue. I've got some work to do on Tax-Brain that I need to prioritize for now and can get back to this once that's completed.

andersonfrailey avatar Mar 18 '19 00:03 andersonfrailey