vowpal_wabbit
vowpal_wabbit copied to clipboard
Allow multiple data files as input
Describe the bug
Call vw with a bad argument and notice that vw does not return a non-zero error code. To detect whether vw rejected the arguments would require us to read the output and look for a line that says "sailing on!" .. which is not really a robust mechanism to return an error response.
To Reproduce
Steps to reproduce the behavior: For example (notice the vw parameters are "bad vw arguments" which are invalid parameters): VW COMMAND:
E:\sharathm\github\sharathmalladi-mwt-ds\DataScience>vw bad vw arguments -d D:/tmp/124bb2ca-a99f-489e-b29c-bc142baa6f51\6359742a010048a58c1892eabd731d4c\6359742a010048a58c1892eabd731d4c_merged_data_2019-01-03_2019-01-03.json.gz -p D:/tmp/124bb2ca-a99f-489e-b29c-bc142baa6f51\6359742a010048a58c1892eabd731d4c\6359742a010048a58c1892eabd731d4c_merged_data_2019-01-03_2019-01-03.json.gz.Custom Policy 1.pred predictions = D:/tmp/124bb2ca-a99f-489e-b29c-bc142baa6f51\6359742a010048a58c1892eabd731d4c\6359742a010048a58c1892eabd731d4c_merged_data_2019-01-03_2019-01-03.json.gz.Custom
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = bad
can't open 'bad', sailing on!
num sources = 0
average since example example current current current
loss last counter weight label predict features
finished run
number of examples = 0
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = n.a.
total feature number = 0
E:\sharathm\github\sharathmalladi-mwt-ds\DataScience>echo %ERRORLEVEL%
0
Expected behavior
The error code after invoking vw should be non-zero since vw did not successfully output the predictions.
Observed Behavior
We instead get back an output that has a line that reads: can't open 'bad', sailing on!
Environment
What version of VW did you use? 8.6.1
What OS or language did you use? Windows command line
Additional context
None
In this situation only bad
is looked at out of bad vw arguments
as a positional parameter for the --data
option. This is a shortcut that's been around for some time. vw arguments
are then ignored as unused values, and not options. The positional parameter actually overrides the value given by --data
, and since bad
is not a file a warning is printed when it can't be opened. So VW does actually exit successfully since there was no data to train on.
Yes, this seems counter intuitive. Handling the positional parameter in combination with the named parameter has been kind of tricky. I do agree, this seems like a bug. Not 100% sure how to deal with it yet.
So I think what could be done here is support multiple data files as input, and then if none of the files are able to be opened them VW will exit with a non-zero return code. You would also need to pass --no_stdin
in order for it to work though as stdin is treated as another input file.
+1 to supporting multiple data files as inputs. I wanted this useful feature for a long time.
#2355 additionally proposed support for globbing as well as passing a directory to the -d option.
@jackgerrits Is anyone working on this issue?
No, you're welcome to work on this
It's still open? Looks interesting to me.
Hi @dnabanita7, yes this is still open. Please feel free to work on it :)
Hello, I'm new to the VW codebase! @jackgerrits could you help me with navigating to which file(s) would need to be changed to support this feature? I'd like to get started with working on this issue.
@jackgerrits is it still open ? can i work on this ?