recommenders
recommenders copied to clipboard
Model.fit call generates error ValueError: Shape must be rank 2 but is rank 3
I get the below in my TFRS prototype where I build a model with user ID's and item ID's, no other features.
Does this error indicate a mismatch of batch sizes? but of what and what? In the 'retrieval' sample, I don't see a match of batch size e.g. between the train dataset and any other datasets. Any clues, would appreciate it.
WARNING:tensorflow:Model was constructed with shape (None, 1000) for input KerasTensor(type_spec=TensorSpec(shape=(None, 1000), dtype=tf.string, name='string_lookup_1_input'), name='string_lookup_1_input', description="created by layer 'string_lookup_1_input'"), but it was called on an input with incompatible shape (None,).
WARNING:tensorflow:From /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Traceback (most recent call last):
File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys_tfrs_songs.py", line 90, in <module>
main(sys.argv)
File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys_tfrs_songs.py", line 61, in main
model_maker.train_and_evaluate(model, NUM_TRAIN_EPOCHS)
File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys-deps.zip/recommender_system/recsys_tf/recsys_tfrs_model.py", line 151, in train_and_evaluate
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1152, in fit
tmp_logs = self.train_function(iterator)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 867, in __call__
result = self._call(*args, **kwds)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 911, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 749, in _initialize
*args, **kwds))
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3045, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3439, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3284, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 657, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 985, in wrapper
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:847 train_function *
return step_function(self, iterator)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/tasks/retrieval.py:157 call *
update_op = self._factorized_metrics.update_state(query_embeddings,
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py:83 update_state *
top_k_predictions, _ = self._candidates(query_embeddings, k=self._k)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/layers/factorized_top_k.py:224 top_k *
joined_scores = tf.concat([state_scores, x_scores], axis=1)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:206 wrapper **
return target(*args, **kwargs)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:1768 concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py:1208 concat_v2
"ConcatV2", values=values, axis=axis, name=name)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
attrs=attr_protos, op_def=op_def)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py:600 _create_op_internal
compute_device)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:3554 _create_op_internal
op_def=op_def)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:2031 __init__
control_input_ops, op_def)
/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:1872 _create_c_op
raise ValueError(str(e))
ValueError: Shape must be rank 2 but is rank 3 for '{{node concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](args_0, args_2, concat/axis)' with input shapes: [?,0], [?,?,?], [].
Here's what my datasets look like. Where is the mismatch?
********************************************************************************
@@@ items_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'numpy.ndarray'>
@@@ x is ndarray
b'music:376223'
********************************************************************************
********************************************************************************
@@@ events_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': b'music:12274071', 'user_id': b'artist:15523352'}
********************************************************************************
********************************************************************************
@@@ train_events_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': b'music:12274071', 'user_id': b'artist:15523352'}
********************************************************************************
********************************************************************************
@@@ cached_train_event_ds in create_model type: <class 'tensorflow.python.data.ops.dataset_ops.CacheDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': array([b'music:12274071', b'music:12501193', b'music:7864297', ...,
b'music:11953766', b'music:10805147', b'music:11953766'],
dtype=object),
'user_id': array([b'artist:15523352', b'artist:12930551', b'artist:31057444', ...,
b'artist:32581820', b'artist:36023938', b'artist:30037204'],
dtype=object)}
********************************************************************************
cached_train_event_ds is what gets passed into the Model.fit method
I have similar issue.
@maciejkula Hi Maciej, could we please have someone assigned to this issue? There are two occurrences, both for me and for Erik.
@erikmajlath
I have similar issue.
Is the stack exactly or nearly exactly the same? and similar datasets?
I think I got past this, when I switched to compile with run_eagerly=True error was different but saying that it got shape [32] (my embedding size) but expected matrix. That gave me an idea that I must be sending one less dimension to the train. My mistake was that I have forgot to create batches form ds with .batch(batch_size).
I hope this helps.
@erikmajlath It's something along these lines for me too. I can tell that when I run the retrieval sample:
the movies ds is shaped like this:
movies = movies.map(lambda x: x["movie_title"])
<MapDataset shapes: (), types: tf.string>
the candidates ds is shaped like this:
cands = movies.batch(128).map(movie_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)
cands is:
<MapDataset shapes: (None, 32), types: tf.float32>
However, when I run my code, my items dataset is
cands = items_ds.batch(128).map(item_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)
candidates:
<MapDataset shapes: (None, 1000, 32), types: tf.float32>
I'm getting that extra 1000 when I invoke
items_ds = tf.data.experimental.make_csv_dataset(
local_file_list, column_names=["item_id"], batch_size=1000, num_parallel_reads=50, sloppy=True,
)
I'm thinking to try and invoke it without setting the batch size so this 1K doesn't get wired in (hmm, batch size is required there). If not maybe run_eagerly=True as you were saying...
@erikmajlath @maciejkula
Set run_eagerly=True but am still getting that error. I'm not groking how I end up with rank 2 against rank 3.
My train dataset is just like the 'retrieval' example, it has elements of type dict with an item_id that's an array of strings and a user_id with an array of strings. Presumably, that's rank 2.
Both the query tower and candidate tower are rank 1, they're arrays of ID's.
input shapes: [8192,0], [?,8192,?], []
This last shape [] seems extraneous. Where could it come from?
@erikmajlath It's something along these lines for me too. I can tell that when I run the retrieval sample:
the movies ds is shaped like this:
movies = movies.map(lambda x: x["movie_title"]) <MapDataset shapes: (), types: tf.string>the candidates ds is shaped like this:
cands = movies.batch(128).map(movie_model) metrics = tfrs.metrics.FactorizedTopK(candidates=cands) cands is: <MapDataset shapes: (None, 32), types: tf.float32>However, when I run my code, my items dataset is
cands = items_ds.batch(128).map(item_model) metrics = tfrs.metrics.FactorizedTopK(candidates=cands) candidates: <MapDataset shapes: (None, 1000, 32), types: tf.float32>I'm getting that extra 1000 when I invoke
items_ds = tf.data.experimental.make_csv_dataset( local_file_list, column_names=["item_id"], batch_size=1000, num_parallel_reads=50, sloppy=True, )I'm thinking to try and invoke it without setting the batch size so this 1K doesn't get wired in (hmm, batch size is required there). If not maybe
run_eagerly=Trueas you were saying...
OK after went through my code, finally I found the issue. The issue for me is that I already batched the dataset before batch for the candidates So in your case:
cands = items_ds.batch(128).map(item_model) metrics = tfrs.metrics.FactorizedTopK(candidates=cands) candidates: <MapDataset shapes: (None, 1000, 32), types: tf.float32>
I think items_ds has already been batched, and that is where the extra dimension comes from
@MaiziXiao You're exactly right, that was my issue too. The items dataset got batched twice.
The problem here is really the error message:
Shape must be rank 2 but is rank 3 for '{{node concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](args_0, args_2, concat/axis)' with input shapes: [?,0], [?,?,?], [].
First of all, it's hard to tell what dataset it's talking about and secondly why the [?,0], [?,?,?], [] ? If it at least listed the actual shape dimensions it'd give one an immediate clue as to what's going on.
We can prob close this issue now though in my opinion the message structure is a bug in TF.
This can be closed although I'd venture to say TF should fix this super-confusing message.
I have same issue.
i also have the same issue, and it happens when validation_data parameter is added on .fit method. Otherwise training process goes fine.
I'm just trying the framework to see if i can use it on a project. So i also use the training set as validation set.
These are the steps to construct training and validation datasets:
df1 = pd.read_csv('ratings.csv')
df1 = df1[["userId", "movieId"]]
df2 = df1[["movieId"]]
u_m_values = df2["movieId"].unique()
train = tf.data.Dataset.from_tensor_slices(dict(df1))
test = tf.data.Dataset.from_tensor_slices(dict(df1))
movies = tf.data.Dataset.from_tensor_slices(u_values)
cached_train = train.batch(1000).cache()
cached_test = test.batch(1000).cache()