recommenders Model.fit call generates error ValueError: Shape must be rank 2 but is rank 3

I get the below in my TFRS prototype where I build a model with user ID's and item ID's, no other features.

Does this error indicate a mismatch of batch sizes? but of what and what? In the 'retrieval' sample, I don't see a match of batch size e.g. between the train dataset and any other datasets. Any clues, would appreciate it.

WARNING:tensorflow:Model was constructed with shape (None, 1000) for input KerasTensor(type_spec=TensorSpec(shape=(None, 1000), dtype=tf.string, name='string_lookup_1_input'), name='string_lookup_1_input', description="created by layer 'string_lookup_1_input'"), but it was called on an input with incompatible shape (None,).
WARNING:tensorflow:From /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Traceback (most recent call last):
  File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys_tfrs_songs.py", line 90, in <module>
    main(sys.argv)
  File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys_tfrs_songs.py", line 61, in main
    model_maker.train_and_evaluate(model, NUM_TRAIN_EPOCHS)
  File "/mnt/tmp/spark-23c1419e-4a5c-4ec7-a86f-1f6f23be73d3/recsys-deps.zip/recommender_system/recsys_tf/recsys_tfrs_model.py", line 151, in train_and_evaluate
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1152, in fit
    tmp_logs = self.train_function(iterator)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 867, in __call__
    result = self._call(*args, **kwds)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 911, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 749, in _initialize
    *args, **kwds))
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3045, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3439, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3284, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 657, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 985, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:847 train_function  *
        return step_function(self, iterator)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/tasks/retrieval.py:157 call  *
        update_op = self._factorized_metrics.update_state(query_embeddings,
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py:83 update_state  *
        top_k_predictions, _ = self._candidates(query_embeddings, k=self._k)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow_recommenders/layers/factorized_top_k.py:224 top_k  *
        joined_scores = tf.concat([state_scores, x_scores], axis=1)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:206 wrapper  **
        return target(*args, **kwargs)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:1768 concat
        return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py:1208 concat_v2
        "ConcatV2", values=values, axis=axis, name=name)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py:600 _create_op_internal
        compute_device)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:3554 _create_op_internal
        op_def=op_def)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:2031 __init__
        control_input_ops, op_def)
    /home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:1872 _create_c_op
        raise ValueError(str(e))

    ValueError: Shape must be rank 2 but is rank 3 for '{{node concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](args_0, args_2, concat/axis)' with input shapes: [?,0], [?,?,?], [].

Feb 24 '21 20:02 dgoldenberg-audiomack

Here's what my datasets look like. Where is the mismatch?

********************************************************************************
@@@ items_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'numpy.ndarray'>
@@@ x is ndarray
b'music:376223'
********************************************************************************

********************************************************************************
@@@ events_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': b'music:12274071', 'user_id': b'artist:15523352'}
********************************************************************************

********************************************************************************
@@@ train_events_ds in init type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': b'music:12274071', 'user_id': b'artist:15523352'}
********************************************************************************

********************************************************************************
@@@ cached_train_event_ds in create_model type: <class 'tensorflow.python.data.ops.dataset_ops.CacheDataset'>
@@@ record type: <class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
@@@ x type: <class 'dict'>
{'item_id': array([b'music:12274071', b'music:12501193', b'music:7864297', ...,
       b'music:11953766', b'music:10805147', b'music:11953766'],
      dtype=object),
 'user_id': array([b'artist:15523352', b'artist:12930551', b'artist:31057444', ...,
       b'artist:32581820', b'artist:36023938', b'artist:30037204'],
      dtype=object)}
********************************************************************************

cached_train_event_ds is what gets passed into the Model.fit method

Feb 24 '21 21:02 dgoldenberg-audiomack

I have similar issue.

Feb 25 '21 12:02 erikmajlath

@maciejkula Hi Maciej, could we please have someone assigned to this issue? There are two occurrences, both for me and for Erik.

Feb 25 '21 12:02 dgoldenberg-audiomack

@erikmajlath

I have similar issue.

Is the stack exactly or nearly exactly the same? and similar datasets?

Feb 25 '21 13:02 dgoldenberg-audiomack

I think I got past this, when I switched to compile with run_eagerly=True error was different but saying that it got shape [32] (my embedding size) but expected matrix. That gave me an idea that I must be sending one less dimension to the train. My mistake was that I have forgot to create batches form ds with .batch(batch_size).

I hope this helps.

Feb 25 '21 14:02 erikmajlath

@erikmajlath It's something along these lines for me too. I can tell that when I run the retrieval sample:

the movies ds is shaped like this:

movies = movies.map(lambda x: x["movie_title"])
<MapDataset shapes: (), types: tf.string>

the candidates ds is shaped like this:

cands = movies.batch(128).map(movie_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)

cands is:
<MapDataset shapes: (None, 32), types: tf.float32>

However, when I run my code, my items dataset is

cands = items_ds.batch(128).map(item_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)

candidates:
<MapDataset shapes: (None, 1000, 32), types: tf.float32>

I'm getting that extra 1000 when I invoke

items_ds = tf.data.experimental.make_csv_dataset(
    local_file_list, column_names=["item_id"], batch_size=1000, num_parallel_reads=50, sloppy=True,
)

I'm thinking to try and invoke it without setting the batch size so this 1K doesn't get wired in (hmm, batch size is required there). If not maybe run_eagerly=True as you were saying...

Feb 25 '21 17:02 dgoldenberg-audiomack

@erikmajlath @maciejkula

Set run_eagerly=True but am still getting that error. I'm not groking how I end up with rank 2 against rank 3.

My train dataset is just like the 'retrieval' example, it has elements of type dict with an item_id that's an array of strings and a user_id with an array of strings. Presumably, that's rank 2.

Both the query tower and candidate tower are rank 1, they're arrays of ID's.

input shapes: [8192,0], [?,8192,?], []

This last shape [] seems extraneous. Where could it come from?

Feb 25 '21 18:02 dgoldenberg-audiomack

@erikmajlath It's something along these lines for me too. I can tell that when I run the retrieval sample:

the movies ds is shaped like this:
movies = movies.map(lambda x: x["movie_title"])
<MapDataset shapes: (), types: tf.string>
the candidates ds is shaped like this:
cands = movies.batch(128).map(movie_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)

cands is:
<MapDataset shapes: (None, 32), types: tf.float32>
However, when I run my code, my items dataset is
cands = items_ds.batch(128).map(item_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)

candidates:
<MapDataset shapes: (None, 1000, 32), types: tf.float32>
I'm getting that extra 1000 when I invoke
items_ds = tf.data.experimental.make_csv_dataset(
    local_file_list, column_names=["item_id"], batch_size=1000, num_parallel_reads=50, sloppy=True,
)
I'm thinking to try and invoke it without setting the batch size so this 1K doesn't get wired in (hmm, batch size is required there). If not maybe run_eagerly=True as you were saying...

OK after went through my code, finally I found the issue. The issue for me is that I already batched the dataset before batch for the candidates So in your case:

cands = items_ds.batch(128).map(item_model)
metrics = tfrs.metrics.FactorizedTopK(candidates=cands)

candidates:
<MapDataset shapes: (None, 1000, 32), types: tf.float32>

I think items_ds has already been batched, and that is where the extra dimension comes from

Mar 01 '21 12:03 MaiziXiao

@MaiziXiao You're exactly right, that was my issue too. The items dataset got batched twice.

The problem here is really the error message:

Shape must be rank 2 but is rank 3 for '{{node concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](args_0, args_2, concat/axis)' with input shapes: [?,0], [?,?,?], [].

First of all, it's hard to tell what dataset it's talking about and secondly why the [?,0], [?,?,?], [] ? If it at least listed the actual shape dimensions it'd give one an immediate clue as to what's going on.

We can prob close this issue now though in my opinion the message structure is a bug in TF.

Mar 01 '21 13:03 dgoldenberg-audiomack

This can be closed although I'd venture to say TF should fix this super-confusing message.

Mar 26 '21 18:03 dgoldenberg-audiomack

I have same issue.

Nov 19 '21 02:11 yunruili

i also have the same issue, and it happens when validation_data parameter is added on .fit method. Otherwise training process goes fine. I'm just trying the framework to see if i can use it on a project. So i also use the training set as validation set. These are the steps to construct training and validation datasets:

df1 = pd.read_csv('ratings.csv')            
df1 = df1[["userId", "movieId"]]
df2 = df1[["movieId"]]    
u_m_values = df2["movieId"].unique()

train = tf.data.Dataset.from_tensor_slices(dict(df1)) 
test = tf.data.Dataset.from_tensor_slices(dict(df1))
movies =  tf.data.Dataset.from_tensor_slices(u_values)

cached_train = train.batch(1000).cache()
cached_test = test.batch(1000).cache()

Jun 27 '22 15:06 hugoferrero

recommenders recommenders copied to clipboard

Model.fit call generates error ValueError: Shape must be rank 2 but is rank 3

recommenders
recommenders copied to clipboard