Save the result of "Preprocessing Text"
@Rezabagheriloye reported in https://github.com/biolab/orange3/issues/5035:
I want to save the result of "Preprocessing Text" of the Corpus in CSV or TXT format. I used "Save Data" while the result was the same as the original corpus. I was wondering if anyone shows me how to save clean data which is the result of "Preprocessing Text".
I wanted to suggest the solution with pickling the Corpus and discovered two new issues:
- When I want to save the preprocessed corpus to file I get the following error while saving (workflow: bug-save-corpus.ows.zip):
Traceback (most recent call last):
File "/Users/primoz/Documents/orange3/Orange/widgets/utils/save/owsavebase.py", line 212, in save_file
self._try_save()
File "/Users/primoz/Documents/orange3/Orange/widgets/utils/save/owsavebase.py", line 223, in _try_save
self.do_save()
File "/Users/primoz/Documents/orange3/Orange/widgets/data/owsave.py", line 76, in do_save
self.writer.write(self.filename, self.data, self.add_type_annotations)
File "/Users/primoz/Documents/orange3/Orange/data/io_base.py", line 575, in write
return cls.write_file(filename, data)
File "/Users/primoz/Documents/orange3/Orange/data/io.py", line 222, in write_file
pickle.dump(data, f, protocol=PICKLE_PROTOCOL)
TypeError: cannot pickle 'dict_keys' object
---------------------------------------------
- Should we disable the possibility to save corpus to CSV, TAB, ... and allow only .pkl like it is made for sparse? Users are confused when they save corpus to
csvand the discover that preprocessing is not stored together with the corpus.
Should we disable the possibility to save corpus to CSV, TAB, ... and allow only .pkl like it is made for sparse? Users are confused when they save corpus to csv and the discover that preprocessing is not stored together with the corpus.
This would disable saving the downloaded corpus from Twitter, Wikipedia and other similar widgets to csv. Not in favour of removing.
While I agree it is slightly confusing, I think it is common practice (in NLTK for example) to have a separate tokens object. I'd rather give a warning or describe this better in the docs.
Agree with you @ajdapretnar, I would definitely add a warning to the widget.
An idea: if Corpus on the input of Save Data, the widget raises a warning saying "To keep preprocessing save as pickle (.pckls)". Should be implemented in orange3.