[WIP] Add operator to support inplace updating of input tensor
Add a prim operator to support nvfuser fd.add_output(output, alias_input), which allows in-place updating of the input tensor. Related context is we want to use it to update the running stats in batch norm.
This prim operator(currently named prims.input_as_output, happy to change it if you have a better name) should only be "used internally" when we are decomposing other operators. The torch implementation of this prim op is copy_, and nvfuser one is fd.add_output(output, alias_input)
TODOs: The current implementation is not compatible with the existing framework in some places, I'm still debugging it (e.g. some of our existing code may assume the first output is always proxy, but here the output is None)
cc: @IvanYashchuk