graph_def_editor
graph_def_editor copied to clipboard
ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions
Hello,
I'm currently using your library to do some operations on the graph of a model in TensorFlow 2, and I'm having some issues with figuring out the proper way to convert a tensor to either a gde.Node or gde.Tensor object to use in the library's functions. I'm converting my tensors as follows:
gra is the name of my gde.Graph object, for reference. After converting the tensors this way, when I run get backward walk ops on my ys_g I get a placeholder operation, and when I run get forward walk ops on the xs_g I get ValueError("Operation {} does not belong to given graph".format(op)) as an error. Looking at the code in the util file I see that this is returned after checking that the op has a value for its graph attribute, so I'm guessing this is what's causing issues with my code. How can I make sure that this attribute gets a value when converting? Any help is appreciated, thank you!
Thanks for reaching out @NicholasMcElroy! Might you have a self-contained piece of Python code that reproduces the problem you are seeing?
It's a bit complex as this is a function that uses variables from another script, but here's the snippet I'm working on:
def gradients(ys, xs, graph, grad_ys=None, **kwargs):
# Serialize graph for use within this function
g = gde.Graph(graph.as_graph_def())
xs_g = []
for x in xs:
xs_g.append(gde.Node(x, x.name, x.op, g=g))
ys_g = gde.Node(ys, ys.name, ys.op, g=g)
# Get a list of forward and backward operations
ops_list = gde.make_list_of_op(g, allow_graph=True)
back_ops = gde.get_backward_walk_ops(ys_g,
inclusive=True)
debug_print("back_ops: %s", back_ops)
fwd_ops = gde.get_forward_walk_ops(xs_g,
inclusive=True,
within_ops=back_ops)
And here's where the function is called:
tf_g = tf.Graph()
with tf_g.as_default():
args = parser.parse_args()
enc = encoder.get_encoder(args.model_name, models_dir=args.models_dir)
hparams = model.default_hparams()
with open(os.path.join('models', args.model_name, 'hparams.json')) as f:
hparams.override_from_dict(json.load(f))
if args.sample_length > hparams.n_ctx:
raise ValueError(
"Can't get samples longer than window size: %s" % hparams.n_ctx)
with tf.Session() as sess:
# Fully static shape required to make memory accounting in
# twremat accurate.
train_context = tf.placeholder(tf.int32, [args.batch_size, 1024])
train_context_in = randomize(train_context, hparams, args.noise)
train_output = model.model(hparams=hparams, X=train_context_in)
train_loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=train_context[:, 1:], logits=train_output['logits'][:, :-1]))
if args.val_every > 0:
val_context = tf.placeholder(tf.int32, [args.val_batch_size, None])
val_output = model.model(hparams=hparams, X=val_context)
val_loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=val_context[:, 1:], logits=val_output['logits'][:, :-1]))
val_loss_summary = tf.summary.scalar('val_loss', val_loss)
sample_context = tf.placeholder(tf.int32, [args.batch_size, None])
tf_sample = sample.sample_sequence(
hparams=hparams,
length=args.sample_length,
context=sample_context,
batch_size=args.batch_size,
temperature=1.0,
top_k=args.top_k,
top_p=args.top_p)
all_vars = [v for v in tf.trainable_variables() if 'model' in v.name]
train_vars = [v for v in all_vars if '/h' in v.name] if args.only_train_transformer_layers else all_vars
opt_grads = gradients(train_loss, train_vars, tf_g)
Sorry, I'm still having trouble reproducing this. Could you provide a stack trace so I can see which of the calls from get_forward_walk_ops()
to get_unique_graph()
is triggering this error?
I've been messing around with it a bit so the error I'm getting now is a little different, but here's the stack trace of what I'm getting now:
Traceback (most recent call last):
File "./traintest.py", line 325, in <module>
main()
File "./traintest.py", line 146, in main
opt_grads = tensorgrader.gradients(train_loss, train_vars, tf_g)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 933, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 764, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3289, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
/content/drive/MyDrive/nlp/tensorgrader.py:30 gradients *
fwd_ops = gde.get_forward_walk_ops(xs_n,
/usr/local/lib/python3.7/dist-packages/graph_def_editor/select.py:466 get_forward_walk_ops *
for new_t in op.outputs:
/usr/local/lib/python3.7/dist-packages/graph_def_editor/node.py:170 outputs
raise ValueError("Outputs of {} have not been set".format(self))
ValueError: Outputs of Node[<bound method BaseResourceVariable.value of <tf.Variable 'model/h11/attn/c_attn/w:0' shape=(1, 768, 2304) dtype=float32>>|name: "model/h11/attn/c_attn/w"
op: "VarHandleOp"
attr {
key: "_class"
value {
list {
s: "loc:@model/h11/attn/c_attn/w"
}
}
}
attr {
key: "allowed_devices"
value {
list {
}
}
}
attr {
key: "container"
value {
s: ""
}
}
attr {
key: "dtype"
value {
type: DT_FLOAT
}
}
attr {
key: "shape"
value {
shape {
dim {
size: 1
}
dim {
size: 768
}
dim {
size: 2304
}
}
}
}
attr {
key: "shared_name"
value {
s: "model/h11/attn/c_attn/w"
}
}
] have not been set
Sorry for the delay in getting back to this.
The most recent stack trace seems to indicate that there's a problem in the conversion from protocol buffers to Node
and Graph
objects. I've added some defensive type checking code to the Node
class's constructor that will hopefully catch the problem closer to its root cause. The code is currently in this branch: https://github.com/frreiss/graph_def_editor_fred/tree/node-type-check
Could you try running your program against the code in that branch and seeing what error results?