Tax-Calculator icon indicating copy to clipboard operation
Tax-Calculator copied to clipboard

Question: Is this a safe way to use highly customized growfactors?

Open donboyd5 opened this issue 4 years ago • 5 comments

Note: next comment gives result of this method.

Question: Is the procedure I describe below for using highly customized growfactors likely to have unanticipated and unintended side effects?

Goal is to advance puf.csv to 2017 as follows:

  • use custom stage 1 growfactors, including new growfactors not currently included in growfactors.csv
  • use stage 2 weights that are in puf_weights.csv
  • do not use stage 3 ratio adjustment
  • apply 2017 law
  • (I will reweight the resulting file but I want to start with the weights in puf_weights.csv and modify them as little as possible)

Issues and concerns:

  • if I were only using different growfactors, I know that I could provide a custom growfactors.csv with different values
  • however, in addition to changing values for a few existing growfactors, I want to create new growfactors - for example, I want to apply a custom growfactor to state and local tax deduction variables e18400 and e18500 (call it ASALT or something like that), which currently use a generic growfactor, ATXPY
  • I don't think I can add a custom growfactor simply by changing growfactors.csv
  • (idea for possible enhancement: maybe use a dict to map the name of every single puf variable grown in the _extrapolate function to the name of a variable to be provided in a custom growfactors.csv, and then loop through the puf variable names and apply the growfactor from the corresponding growfactor column)
  • so I think what I need to do is write my own version of _extrapolate and apply it to puf.csv before creating a Records object with the so-adjusted puf.csv
  • the question then is how to meet the next 2 requirements of applying stage 2 weights and not applying stage 3 ratio adjustments
  • the problem is with the first of those two requirements; the Data class does not like the idea of providing weights but not providing growfactors - it has the following code which says they are inconsistent:
        if data is not None:
            # check consistency of specified gfactors and weights
            if gfactors is None and weights is None:
                self.__aging_data = False
            elif gfactors is not None and weights is not None:
                self.__aging_data = True
            else:
                raise ValueError('gfactors and weights are inconsistent')

This seems easy enough to trick by providing a file of growfactors that are all one.

Here is the procedure I am planning to use:

  1. read puf.csv into a pandas dataframe
  2. modify dollar values in the dataframe with my own customized version of the _extrapolate function
  3. read in a dummy growfactors.csv file where all values are 1
  4. Advance puf.csv to 2017 using puf_weights.csv and the dummy growfactors

Thus, the whole process would look like this:

PUF_NAME = INDIR + 'puf.csv'
WEIGHTS_NAME = INDIR + 'puf_weights.csv'

puf = pd.read_csv(PUF_NAME)

# TODO: create growfactors_custom.csv, which has the expanded set of growfactors
# TODO: write extrapolate_custom based on _extrapolate
puf_extrap = extrapolate_custom(puf, 'growfactors_custom.csv')

# TODO: create growfactors_ones.csv - same format as growfactors.csv but all ones
gfactor_ones = tc.GrowFactors('growfactors_ones.csv')

recs = tc.Records(data=puf_extrap,
                  start_year=2011,
                  gfactors=gfactor_ones,
                  weights=WEIGHTS_NAME,
                  adjust_ratios=None)  # don't use puf_ratios

pol = tc.Policy()
calc = tc.Calculator(policy=pol, records=recs)
calc.advance_to_year(2017)
calc.calc_all()

My questions are:

  • does this seem like the right way to go
  • would it create any unintended or bad side effects (assume that I write my custom extrapolate function properly, and extrapolate the same variables that _extrapolate extrapolates - no more, no less)

Many thanks.

donboyd5 avatar Nov 02 '20 11:11 donboyd5

This method appears to have worked like a charm.

I wrote my own function that I call before creating a records object and used the general approach above. I had to make 2 changes in my function to the _extrapolate code in records.py (in addition to replacing self with a copy of puf):

  1. comment out the code that extrapolates benefits, which pertains only to the cps I guess; I haven't chased this down and don't need it for now
  2. several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages; I am new to python and don't quite understand why that notation might be needed here; several occur within np.where clause but one does not; my reading suggests it may create a view rather than change a file in place but of course we want the file to change here; I could not find via stackoverflow etc. a clear explanation of why this might be needed and I experimented on made-up arrays to see if it altered the results and, at least in my experimentation, it did not; as a result, I removed all of the "[:]" notations and checked the results (which now ran without warnings) and all seemed good. Maybe that notation is needed for a class method whereas my function is freestanding, I don't know.

However, if someone who understands python better than me could explain why this is used on these lines I'd appreciate it.

image

donboyd5 avatar Nov 03 '20 09:11 donboyd5

@donboyd5, I'm glad this method for customized growfactors is working well. I will look over that in more depth shortly, but first I wanted to remember the motivation for the [:]s.

several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages;

A google search for "numpy [:]" will offer some help, but I found an alternative technique for investigating even better: Go to the file in question and click "blame" in the upper right.

image

That will add a pane on the left showing the PR where each line was last changed.

Jumping to some of the lines using [:], we see:

image

One of @martinholmer's useful commit titles looks particularly relevant: "Incorporate several efficiency suggestions made by T.J..."

So I click on that, and it takes me here.

image

This is just one commit in a PR, but I want to see the discussion in the PR that led to it, so I click on the #603 in the upper left. There I find this helpful comment from @talumbau : https://github.com/PSLmodels/Tax-Calculator/pull/603#discussion_r53331956

There is more of interest in that comment about efficient memory allocation when using numpy arrays, but the key part about [:] is below:

image

MattHJensen avatar Nov 05 '20 13:11 MattHJensen

On to the substance of this issue: I don't see anything wrong with the approach you're using. If it appears to be working now, that is excellent.

One note, though, relating to this comment:

I don't think I can add a custom growfactor simply by changing growfactors.csv

This is true, it would be a bit more involved. But I don't think it would be that much more involved. I suspect you want to keep moving forward now that you have something working, but if you reconsider and would like to use a modified _extrapolate method that is part of the Records class of a fork of Tax-Calculator rather than a new, "off-calculator" version of _extrapolate, I would be happy to set up an example and then walk you through how I did it.

MattHJensen avatar Nov 05 '20 14:11 MattHJensen

@donboyd5, I'm glad this method for customized growfactors is working well. I will look over that in more depth shortly, but first I wanted to remember the motivation for the [:]s.

several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages;

A google search for "numpy [:]" will offer some help, but I found an alternative technique for investigating even better: Go to the file in question and click "blame" in the upper right.

image

That will add a pane on the left showing the PR where each line was last changed.

Jumping to some of the lines using [:], we see:

image

One of @martinholmer's useful commit titles looks particularly relevant: "Incorporate several efficiency suggestions made by T.J..."

So I click on that, and it takes me here.

image

This is just one commit in a PR, but I want to see the discussion in the PR that led to it, so I click on the #603 in the upper left. There I find this helpful comment from @talumbau : #603 (comment)

There is more of interest in that comment about efficient memory allocation when using numpy arrays, but the key part about [:] is below:

image

Thanks, @MattHJensen! I wondered what "Blame" was about and now I know. Very useful. Always important to learn how to fish. Plus, I learned something important about numpy.where() and memory allocation.

donboyd5 avatar Nov 05 '20 15:11 donboyd5

On to the substance of this issue: I don't see anything wrong with the approach you're using. If it appears to be working now, that is excellent.

One note, though, relating to this comment:

I don't think I can add a custom growfactor simply by changing growfactors.csv

This is true, it would be a bit more involved. But I don't think it would be that much more involved. I suspect you want to keep moving forward now that you have something working, but if you reconsider and would like to use a modified _extrapolate method that is part of the Records class of a fork of Tax-Calculator rather than a new, "off-calculator" version of _extrapolate, I would be happy to set up an example and then walk you through how I did it.

Many thanks, @MattHJensen. As you inferred, I'm good for now as there are a lot of things in line ahead of this. My eventual goal is to be able to create a growfactors_custom.csv file that has any set of columns I desire, and then pass that file or file name or growfactor object to _extrapolate_custom along with a dict that maps every desired-to-be-grown puf variable (with the puf variable names as dict keys) to a column in growfactors_custom.csv (with growfactor_custom.csv column names as dict values), and then have _extrapolate_custom loop through the dict keys and multiply puf values for that puf variable by the growfactor defined by the key-value combination in the dict, for the year in question (not sure yet whether I should use the key-value combination (pufvariable-growfactorscolumn) to look up the value in a matrix of growfactors, or whether I should have the dict include the values for each year as lists -- TBD). This requires some thought about how to deal with the numpy.where() situations where a single puf variable might have two different growfactors depending on a logical condition. It seems a shame to let that prevent flexible customization, so I would look for a way to work around it.

Right now, as you inferred, I simply created an _extrapolate_custom function that hard codes the column names of a growfactors_custom dataframe (obtained from growfactors_custom.csv) in the multiplication process and looks up the year in the dataframe. It is a little slow (maybe 2 or 3 seconds) which seems far longer than something like this should take.

Once I figure out the dict approach as a free-standing function, I would love to know how to incorporate it as a method of Records in a fork of tax-calculator. I think I can figure out how to implement the dict approach as a function and will take a crack at it when other things are behind me and will seek feedback after I puzzle through it. If you think there is a better design than this for how to do completely flexible growfactor customization I'd much appreciate knowing; I have been thinking of the dict approach because my understanding is that value lookup is extremely fast with dicts.

donboyd5 avatar Nov 05 '20 15:11 donboyd5