Will Howes
Will Howes
Running the code from the [README](https://github.com/oduwsdl/sumgram#python-script-usage), specifically `sumgrams = get_top_sumgrams(doc_lst, ngram, params=params)`, returns the following exception: ``` sumgram.py, 1178, (, InvalidParameterError('The \'stop_words\' parameter of CountVectorizer must be a str among...
Pretty straightforward testing for `URLToString` function. But I'm still not quite sure what the following logic in `URLToString` does so I wasn't able to write a test for it ([link](https://github.com/internetarchive/Zeno/blob/main/internal/pkg/utils/url_string.go#L28)):...
Using a mocked server, perform a series of crawls that test whether the resulting WARC matches the parameters given to the crawl at runtime. It's important to clarify that this...
The UA string should be generated to match whatever Zeno would generate by default. Okay to fix in a separate issue. _Originally posted by @willmhowes in https://github.com/internetarchive/Zeno/pull/514#discussion_r2492528609_
Input and expected output should be defined together, either in the test directly or together in the file storing the test URLs _Originally posted by @willmhowes in https://github.com/internetarchive/Zeno/pull/514#discussion_r2492547447_
## Problem If the seencheck in preprocessor returns an error of any kind, there is a panic: https://github.com/internetarchive/Zeno/blob/2abbcd32abdf62cab1eb0031b27f9b13c48c41c7/internal/pkg/preprocessor/preprocessor.go#L140-L142 This behavior makes sense to guarantee that fundamental problems are not occuring...
Rather than archiving all assets extracted from every URL, there should be a way to limit by: - number of assets - file type of assets - total time spent...
https://github.com/internetarchive/Zeno/blob/a6c07f77bffc3ad3bc8a05abb5ef95404e00b763/internal/pkg/archiver/archiver.go#L33-L42 The `archiver` struct is defined with the idea of supporting two HTTP clients in mind, so that some traffic can selectively be routed through a proxy. Here is where...