haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Strange behavior from `FileTypeClassifier`

Open anakin87 opened this issue 1 year ago • 3 comments

Describe the bug When passed a list of files with different extensions, FileTypeClassifier throws an error. Reading the code, this behavior is expected, but I can't understand why... https://github.com/deepset-ai/haystack/blob/c91316e862c3fb751b3e8996ddd5f99b5563ae81/haystack/nodes/file_classifier/file_type.py#L74-L78 Shouldn't this node classify the files and route them to different destinations? If this behavior is correct, it's not understandable from the docs.

Error message

Traceback (most recent call last): File "prova_classifier.py", line 12, in classifier.run(file_paths=docs_to_index) File "/home/anakin87/apps/haystack/haystack/nodes/file_classifier/file_type.py", line 114, in run extension = self._get_extension(paths) File "/home/anakin87/apps/haystack/haystack/nodes/file_classifier/file_type.py", line 98, in _get_extension raise ValueError(f"Multiple file types are not allowed at once.") ValueError: Multiple file types are not allowed at once.

To Reproduce

import os
from haystack.nodes import FileTypeClassifier

classifier = FileTypeClassifier(supported_types=["py", "sh", "png", "yml"])

docs_to_index = next(os.walk('./ata2'))[2]  # Substitute the path to reproduce
print("Docs to index:")
for doc in docs_to_index:
    print(f" - {doc}")

classifier.run(file_paths=docs_to_index)

Other little note https://github.com/deepset-ai/haystack/blob/c91316e862c3fb751b3e8996ddd5f99b5563ae81/haystack/nodes/file_classifier/file_type.py#L36-L41 Probably this limit is no more valid, standing to recent work and to https://github.com/deepset-ai/haystack/blob/c91316e862c3fb751b3e8996ddd5f99b5563ae81/haystack/nodes/file_classifier/file_type.py#L54-L56 Should the docstring be modified?

anakin87 avatar Aug 08 '22 19:08 anakin87

@ZanSara - looping you in here. This seems to have been changed in a PR from a while ago here

TuanaCelik avatar Aug 09 '22 22:08 TuanaCelik

Hello @anakin87 and welcome to the fantastic world of decision nodes in Haystack :joy:

The limitation you just met in FileTypeClassifier is by design. Decision nodes were not well thought out when Pipeline was first implemented, they came later as we realized we wanted to have parallel retrieval, document routing, classification nodes, etc. Therefore they're quirky to say the least.

The hurdle here is that BaseComponent.run returns a tuple that looks like {my output dictionary}, "output_<n>". It's a single tuple with the name of one edge, so in the current form, you can only direct all output on one edge or another. The amazing fact is that currently you can receive output from multiple nodes in parallel, but you cannot send output on multiple edges in parallel (with catches: see the discussion on the "magic" inputs parameter and JoinNode subclasses if you want a headache: https://github.com/deepset-ai/haystack/pull/2981#issuecomment-1207850632 and other linked issues). The "illusion" of branched pipelines is that separate run() calls might take separate paths, which is fine for querying, but awkward for indexing.

To make this work, the solution is to not hit that branch: pass just one file at a time to the indexing pipeline, or group them by time before feeding them into the run method. Or you can use the run_batch method... if only it was working :smile: Note that run_batch doesn't have the limitations of run, but most implementations are buggy as they're not well tested, and very few users seems to be aware of them.

On a good note, this is an issue we're very aware of (at least I am) and we have plans to address this situation: https://github.com/deepset-ai/haystack/issues/2807. So for now I'd recommend not wasting too much energy on an implementation that has little hope for incremental improvements and just wait for the big-bang coming later this year. If there's anything you always wanted pipelines to be doing, go comment under that issue and there's a good change it will see the light before the end of the year.

ZanSara avatar Aug 10 '22 09:08 ZanSara

PS: yes the docstrings needs to be modified, thank you for pointing that out. @agnieszka-m I think @brandenchan is working on it in https://github.com/deepset-ai/haystack-website/issues/374

ZanSara avatar Aug 10 '22 09:08 ZanSara

Thank you for the clarification @ZanSara! Should I close this issue or leave it open?

anakin87 avatar Aug 10 '22 09:08 anakin87

I think this one can be closed, but let's list it into the main refactoring issue for extra context :+1:

ZanSara avatar Aug 10 '22 09:08 ZanSara