Will Howes

Results 10 issues of Will Howes

Running the code from the [README](https://github.com/oduwsdl/sumgram#python-script-usage), specifically `sumgrams = get_top_sumgrams(doc_lst, ngram, params=params)`, returns the following exception: ``` sumgram.py, 1178, (, InvalidParameterError('The \'stop_words\' parameter of CountVectorizer must be a str among...

Pretty straightforward testing for `URLToString` function. But I'm still not quite sure what the following logic in `URLToString` does so I wasn't able to write a test for it ([link](https://github.com/internetarchive/Zeno/blob/main/internal/pkg/utils/url_string.go#L28)):...

enhancement

Using a mocked server, perform a series of crawls that test whether the resulting WARC matches the parameters given to the crawl at runtime. It's important to clarify that this...

enhancement
internal-only

The UA string should be generated to match whatever Zeno would generate by default. Okay to fix in a separate issue. _Originally posted by @willmhowes in https://github.com/internetarchive/Zeno/pull/514#discussion_r2492528609_

Input and expected output should be defined together, either in the test directly or together in the file storing the test URLs _Originally posted by @willmhowes in https://github.com/internetarchive/Zeno/pull/514#discussion_r2492547447_

## Problem If the seencheck in preprocessor returns an error of any kind, there is a panic: https://github.com/internetarchive/Zeno/blob/2abbcd32abdf62cab1eb0031b27f9b13c48c41c7/internal/pkg/preprocessor/preprocessor.go#L140-L142 This behavior makes sense to guarantee that fundamental problems are not occuring...

Rather than archiving all assets extracted from every URL, there should be a way to limit by: - number of assets - file type of assets - total time spent...

enhancement

https://github.com/internetarchive/Zeno/blob/a6c07f77bffc3ad3bc8a05abb5ef95404e00b763/internal/pkg/archiver/archiver.go#L33-L42 The `archiver` struct is defined with the idea of supporting two HTTP clients in mind, so that some traffic can selectively be routed through a proxy. Here is where...

enhancement
P3