datatrove
datatrove copied to clipboard
Bug: Default Adapter assumes type of metadata column in source data
In the last line below, data.pop("metadata") could be of type other than dict, and will fail then.
File: src/datatrove/pipeline/readers/base.py
def _default_adapter(self, data: dict, path: str, id_in_file: int | str):
"""
The default data adapter to adapt input data into the datatrove Document format
Args:
data: a dictionary with the "raw" representation of the data
path: file path or source for this sample
id_in_file: its id in this particular file or source
Returns: a dictionary with text, id, media and metadata fields
"""
return {
"text": data.pop(self.text_key, ""),
"id": data.pop(self.id_key, f"{path}/{id_in_file}"),
"media": data.pop("media", []),
"metadata": data.pop("metadata", {}) | data, # remaining data goes into metadata
}
It happened when I tried to tokenize FineMath, which has a metadata column with a a default string type.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
return self._run_for_rank(rank, local_rank)
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 109, in _run_for_rank
raise e
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 90, in _run_for_rank
pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/base.py", line 119, in __call__
return self.run(data, rank, world_size)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 390, in run
outputfile: TokenizedFile = self.write_unshuffled(data, unshuf_filename)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 359, in write_unshuffled
for batch in batched(data, self.batch_size):
File "/usr/local/lib/python3.10/dist-packages/datatrove/utils/batching.py", line 20, in batched
while batch := list(itertools.islice(it, n)):
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 125, in run
document = self.get_document_from_dict(line, self.dataset, f"{rank:05d}/{li}")
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 60, in get_document_from_dict
document = super().get_document_from_dict(data, source_file, id_in_file)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 79, in get_document_from_dict
parsed_data = self.adapter(data, source_file, id_in_file)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 65, in _default_adapter
"metadata": data.pop("metadata", {}) | data, # remaining data goes into metadata
TypeError: unsupported operand type(s) for |: 'str' and 'dict'
Same problem here.
Same problem here.