lambdaupb

Results 30 comments of lambdaupb
trafficstars

some files were copied in this repo: https://github.com/oaken-source/pyd2s/tree/master/docs

@psycho23 I am very interested in the material not being lost. zippyshare or whatever.

Starting the JVM with `-XX:+UseG1GC -XX:+UseStringDeduplication -Xlog:stringdedup*=debug` leads to the following debug output of the G1GC string deduplication: ``` [163.650s][info ][gc,stringdedup] Concurrent String Deduplication (163.650s) [163.650s][info ][gc,stringdedup] Concurrent String Deduplication...

This is overall with just a loaded pipeline. `tokenize,ssplit,pos,lemma,ner,depparse,coref,quote` The issue description I think covers most of the duplicates. It should be possible to catch most of it doing some...

Distsim.lexicon and NERFeatureFactory.lexicon seem to be deserialized and so deduplication needs to be injected with a magic deserialize method. ```java private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException { in.defaultReadObject(); StringDedup.INST.dedupInplace(lexicon);...

056c413b2468ce6937dfa6aeb4ae03235e5fa09a comes out at 3243MB, so 82MB improvement. Its quick to measure though, Just download https://visualvm.github.io/ and the pipeline setup + sleep main.

Yeah, just the master branch in this repo at 056c413. I just put the models jar manually into the project in intellij and start my main class in the IDE.

That should work great as well. But be sure to call System.gc() a few times.

Well, it accounts for around 300MB extra total, and the models are loaded sequentially. I think it would still achieve lower peak usage. Deduplicating the strings before serializing is probably...

It is very magical. I don't know how much java serialization is used in this project in general, but replacing it with more boring solutions might be advisable in the...