datatrove Bug: Default Adapter assumes type of metadata column in source data

In the last line below, data.pop("metadata") could be of type other than dict, and will fail then.

File: src/datatrove/pipeline/readers/base.py


    def _default_adapter(self, data: dict, path: str, id_in_file: int | str):
        """
        The default data adapter to adapt input data into the datatrove Document format

        Args:
            data: a dictionary with the "raw" representation of the data
            path: file path or source for this sample
            id_in_file: its id in this particular file or source

        Returns: a dictionary with text, id, media and metadata fields

        """
        return {
            "text": data.pop(self.text_key, ""),
            "id": data.pop(self.id_key, f"{path}/{id_in_file}"),
            "media": data.pop("media", []),
            "metadata": data.pop("metadata", {}) | data,  # remaining data goes into metadata
        }

It happened when I tried to tokenize FineMath, which has a metadata column with a a default string type.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 109, in _run_for_rank
    raise e
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 390, in run
    outputfile: TokenizedFile = self.write_unshuffled(data, unshuf_filename)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 359, in write_unshuffled
    for batch in batched(data, self.batch_size):
  File "/usr/local/lib/python3.10/dist-packages/datatrove/utils/batching.py", line 20, in batched
    while batch := list(itertools.islice(it, n)):
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 125, in run
    document = self.get_document_from_dict(line, self.dataset, f"{rank:05d}/{li}")
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 60, in get_document_from_dict
    document = super().get_document_from_dict(data, source_file, id_in_file)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 79, in get_document_from_dict
    parsed_data = self.adapter(data, source_file, id_in_file)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 65, in _default_adapter
    "metadata": data.pop("metadata", {}) | data,  # remaining data goes into metadata
TypeError: unsupported operand type(s) for |: 'str' and 'dict'

Jan 25 '25 17:01 amangup

Same problem here.

Mar 05 '25 00:03 DarthMurse

Same problem here.

Mar 17 '25 08:03 ftgreat