Nick Ruest

Results 44 comments of Nick Ruest

Might as well list everything we can do with [twarc utils](https://github.com/edsu/twarc/tree/master/utils). Twarc utils that I heavily use: - deduplicate - embeds (embedded media in a tweet) - filter_date - geojson...

@jrwiebe @lintool @ianmilligan1 finally wrote it up. Let me know if I missed anything, or y'all want me to provide more info.

Sorry, yeah, that's kinda vague. I meant overall, we've analyzed 170T of collections ranging from under a 1G of WARCs up to 12T with the same basic Spark settings. I...

Reading the garbage collection post reminded me that Spark allocates (not sure if that is the right word), ~55-60% of "storage" memory for executors. A smarter version of me would...

I'm going to start adding collection info on failed collections here: | id | size | spark version | aut version | |----|-------|-------------------|---------------| | 593/12190 | 141G | 2.4.3 |...

I'm not certain that record size is the issue. I've collected some data from a number of collections that fail, and succeed. Large record size doesn't _appear_ to be the...

Hit a very similar error a couple times over the last few days on two different collections with Java 11 and Spark 3.0.0. I had initially thought it was a...

One important point I left out that should reinforce the conclusion of the above comment, is that I'm fairly certain that the issue isn't WARC/ARC file specific. One example to...

@lintool's comment [above](https://github.com/archivesunleashed/aut/issues/317#issuecomment-548487796) hints at the problem as well; all those records above 2G.