Tax-Calculator
Tax-Calculator copied to clipboard
Question: Is this a safe way to use highly customized growfactors?
Note: next comment gives result of this method.
Question: Is the procedure I describe below for using highly customized growfactors likely to have unanticipated and unintended side effects?
Goal is to advance puf.csv to 2017 as follows:
- use custom stage 1 growfactors, including new growfactors not currently included in growfactors.csv
- use stage 2 weights that are in puf_weights.csv
- do not use stage 3 ratio adjustment
- apply 2017 law
- (I will reweight the resulting file but I want to start with the weights in puf_weights.csv and modify them as little as possible)
Issues and concerns:
- if I were only using different growfactors, I know that I could provide a custom growfactors.csv with different values
- however, in addition to changing values for a few existing growfactors, I want to create new growfactors - for example, I want to apply a custom growfactor to state and local tax deduction variables e18400 and e18500 (call it ASALT or something like that), which currently use a generic growfactor, ATXPY
- I don't think I can add a custom growfactor simply by changing growfactors.csv
- (idea for possible enhancement: maybe use a dict to map the name of every single puf variable grown in the
_extrapolate
function to the name of a variable to be provided in a custom growfactors.csv, and then loop through the puf variable names and apply the growfactor from the corresponding growfactor column) - so I think what I need to do is write my own version of
_extrapolate
and apply it to puf.csv before creating a Records object with the so-adjusted puf.csv - the question then is how to meet the next 2 requirements of applying stage 2 weights and not applying stage 3 ratio adjustments
- the problem is with the first of those two requirements; the Data class does not like the idea of providing weights but not providing growfactors - it has the following code which says they are inconsistent:
if data is not None:
# check consistency of specified gfactors and weights
if gfactors is None and weights is None:
self.__aging_data = False
elif gfactors is not None and weights is not None:
self.__aging_data = True
else:
raise ValueError('gfactors and weights are inconsistent')
This seems easy enough to trick by providing a file of growfactors that are all one.
Here is the procedure I am planning to use:
- read puf.csv into a pandas dataframe
- modify dollar values in the dataframe with my own customized version of the
_extrapolate
function - read in a dummy growfactors.csv file where all values are 1
- Advance puf.csv to 2017 using puf_weights.csv and the dummy growfactors
Thus, the whole process would look like this:
PUF_NAME = INDIR + 'puf.csv'
WEIGHTS_NAME = INDIR + 'puf_weights.csv'
puf = pd.read_csv(PUF_NAME)
# TODO: create growfactors_custom.csv, which has the expanded set of growfactors
# TODO: write extrapolate_custom based on _extrapolate
puf_extrap = extrapolate_custom(puf, 'growfactors_custom.csv')
# TODO: create growfactors_ones.csv - same format as growfactors.csv but all ones
gfactor_ones = tc.GrowFactors('growfactors_ones.csv')
recs = tc.Records(data=puf_extrap,
start_year=2011,
gfactors=gfactor_ones,
weights=WEIGHTS_NAME,
adjust_ratios=None) # don't use puf_ratios
pol = tc.Policy()
calc = tc.Calculator(policy=pol, records=recs)
calc.advance_to_year(2017)
calc.calc_all()
My questions are:
- does this seem like the right way to go
-
would it create any unintended or bad side effects (assume that I write my custom extrapolate function properly, and extrapolate the same variables that
_extrapolate
extrapolates - no more, no less)
Many thanks.
This method appears to have worked like a charm.
I wrote my own function that I call before creating a records object and used the general approach above. I had to make 2 changes in my function to the _extrapolate
code in records.py (in addition to replacing self with a copy of puf):
- comment out the code that extrapolates benefits, which pertains only to the cps I guess; I haven't chased this down and don't need it for now
- several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages; I am new to python and don't quite understand why that notation might be needed here; several occur within np.where clause but one does not; my reading suggests it may create a view rather than change a file in place but of course we want the file to change here; I could not find via stackoverflow etc. a clear explanation of why this might be needed and I experimented on made-up arrays to see if it altered the results and, at least in my experimentation, it did not; as a result, I removed all of the "[:]" notations and checked the results (which now ran without warnings) and all seemed good. Maybe that notation is needed for a class method whereas my function is freestanding, I don't know.
However, if someone who understands python better than me could explain why this is used on these lines I'd appreciate it.
@donboyd5, I'm glad this method for customized growfactors is working well. I will look over that in more depth shortly, but first I wanted to remember the motivation for the [:]s.
several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages;
A google search for "numpy [:]" will offer some help, but I found an alternative technique for investigating even better: Go to the file in question and click "blame" in the upper right.
That will add a pane on the left showing the PR where each line was last changed.
Jumping to some of the lines using [:], we see:
One of @martinholmer's useful commit titles looks particularly relevant: "Incorporate several efficiency suggestions made by T.J..."
So I click on that, and it takes me here.
This is just one commit in a PR, but I want to see the discussion in the PR that led to it, so I click on the #603 in the upper left. There I find this helpful comment from @talumbau : https://github.com/PSLmodels/Tax-Calculator/pull/603#discussion_r53331956
There is more of interest in that comment about efficient memory allocation when using numpy arrays, but the key part about [:] is below:
On to the substance of this issue: I don't see anything wrong with the approach you're using. If it appears to be working now, that is excellent.
One note, though, relating to this comment:
I don't think I can add a custom growfactor simply by changing growfactors.csv
This is true, it would be a bit more involved. But I don't think it would be that much more involved. I suspect you want to keep moving forward now that you have something working, but if you reconsider and would like to use a modified _extrapolate
method that is part of the Records class of a fork of Tax-Calculator rather than a new, "off-calculator" version of _extrapolate
, I would be happy to set up an example and then walk you through how I did it.
@donboyd5, I'm glad this method for customized growfactors is working well. I will look over that in more depth shortly, but first I wanted to remember the motivation for the [:]s.
several lines, included in the screenshot below, use the "[:]" notation on the lefthand side and that triggered warning messages;
A google search for "numpy [:]" will offer some help, but I found an alternative technique for investigating even better: Go to the file in question and click "blame" in the upper right.
That will add a pane on the left showing the PR where each line was last changed.
Jumping to some of the lines using [:], we see:
One of @martinholmer's useful commit titles looks particularly relevant: "Incorporate several efficiency suggestions made by T.J..."
So I click on that, and it takes me here.
This is just one commit in a PR, but I want to see the discussion in the PR that led to it, so I click on the #603 in the upper left. There I find this helpful comment from @talumbau : #603 (comment)
There is more of interest in that comment about efficient memory allocation when using numpy arrays, but the key part about [:] is below:
Thanks, @MattHJensen! I wondered what "Blame" was about and now I know. Very useful. Always important to learn how to fish. Plus, I learned something important about numpy.where() and memory allocation.
On to the substance of this issue: I don't see anything wrong with the approach you're using. If it appears to be working now, that is excellent.
One note, though, relating to this comment:
I don't think I can add a custom growfactor simply by changing growfactors.csv
This is true, it would be a bit more involved. But I don't think it would be that much more involved. I suspect you want to keep moving forward now that you have something working, but if you reconsider and would like to use a modified
_extrapolate
method that is part of the Records class of a fork of Tax-Calculator rather than a new, "off-calculator" version of_extrapolate
, I would be happy to set up an example and then walk you through how I did it.
Many thanks, @MattHJensen. As you inferred, I'm good for now as there are a lot of things in line ahead of this. My eventual goal is to be able to create a growfactors_custom.csv file that has any set of columns I desire, and then pass that file or file name or growfactor object to _extrapolate_custom along with a dict that maps every desired-to-be-grown puf variable (with the puf variable names as dict keys) to a column in growfactors_custom.csv (with growfactor_custom.csv column names as dict values), and then have _extrapolate_custom loop through the dict keys and multiply puf values for that puf variable by the growfactor defined by the key-value combination in the dict, for the year in question (not sure yet whether I should use the key-value combination (pufvariable-growfactorscolumn) to look up the value in a matrix of growfactors, or whether I should have the dict include the values for each year as lists -- TBD). This requires some thought about how to deal with the numpy.where() situations where a single puf variable might have two different growfactors depending on a logical condition. It seems a shame to let that prevent flexible customization, so I would look for a way to work around it.
Right now, as you inferred, I simply created an _extrapolate_custom function that hard codes the column names of a growfactors_custom dataframe (obtained from growfactors_custom.csv) in the multiplication process and looks up the year in the dataframe. It is a little slow (maybe 2 or 3 seconds) which seems far longer than something like this should take.
Once I figure out the dict approach as a free-standing function, I would love to know how to incorporate it as a method of Records in a fork of tax-calculator. I think I can figure out how to implement the dict approach as a function and will take a crack at it when other things are behind me and will seek feedback after I puzzle through it. If you think there is a better design than this for how to do completely flexible growfactor customization I'd much appreciate knowing; I have been thinking of the dict approach because my understanding is that value lookup is extremely fast with dicts.