orange3-text icon indicating copy to clipboard operation
orange3-text copied to clipboard

Save the result of "Preprocessing Text"

Open PrimozGodec opened this issue 5 years ago • 3 comments

@Rezabagheriloye reported in https://github.com/biolab/orange3/issues/5035:

I want to save the result of "Preprocessing Text" of the Corpus in CSV or TXT format. I used "Save Data" while the result was the same as the original corpus. I was wondering if anyone shows me how to save clean data which is the result of "Preprocessing Text".

I wanted to suggest the solution with pickling the Corpus and discovered two new issues:

  • When I want to save the preprocessed corpus to file I get the following error while saving (workflow: bug-save-corpus.ows.zip):
Traceback (most recent call last):
  File "/Users/primoz/Documents/orange3/Orange/widgets/utils/save/owsavebase.py", line 212, in save_file
    self._try_save()
  File "/Users/primoz/Documents/orange3/Orange/widgets/utils/save/owsavebase.py", line 223, in _try_save
    self.do_save()
  File "/Users/primoz/Documents/orange3/Orange/widgets/data/owsave.py", line 76, in do_save
    self.writer.write(self.filename, self.data, self.add_type_annotations)
  File "/Users/primoz/Documents/orange3/Orange/data/io_base.py", line 575, in write
    return cls.write_file(filename, data)
  File "/Users/primoz/Documents/orange3/Orange/data/io.py", line 222, in write_file
    pickle.dump(data, f, protocol=PICKLE_PROTOCOL)
TypeError: cannot pickle 'dict_keys' object
---------------------------------------------
  • Should we disable the possibility to save corpus to CSV, TAB, ... and allow only .pkl like it is made for sparse? Users are confused when they save corpus to csv and the discover that preprocessing is not stored together with the corpus.

PrimozGodec avatar Oct 21 '20 12:10 PrimozGodec

Should we disable the possibility to save corpus to CSV, TAB, ... and allow only .pkl like it is made for sparse? Users are confused when they save corpus to csv and the discover that preprocessing is not stored together with the corpus.

This would disable saving the downloaded corpus from Twitter, Wikipedia and other similar widgets to csv. Not in favour of removing.

While I agree it is slightly confusing, I think it is common practice (in NLTK for example) to have a separate tokens object. I'd rather give a warning or describe this better in the docs.

ajdapretnar avatar Oct 21 '20 12:10 ajdapretnar

Agree with you @ajdapretnar, I would definitely add a warning to the widget.

PrimozGodec avatar Oct 21 '20 13:10 PrimozGodec

An idea: if Corpus on the input of Save Data, the widget raises a warning saying "To keep preprocessing save as pickle (.pckls)". Should be implemented in orange3.

ajdapretnar avatar Apr 01 '22 08:04 ajdapretnar