opengrok
opengrok copied to clipboard
Can't index 19GB source files with OpenGrok1.0
Hello all,
I am trying to index 19 GB source files with Opengrok-1.0. But indexing is failing every time. Following is the relevant information Java - jdk1.8.0_51 Ant - apache-ant-1.9.0 Source files - 19GB (Single project) My changes to OpenGrok Script " OPENGROK_GENERATE_HISTORY=off OPENGROK_SCAN_REPOS=false OPENGROK_VERBOSE=true OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 256"
export OPENGROK_GENERATE_HISTORY export OPENGROK_SCAN_REPOS export OPENGROK_VERBOSE export OPENGROK_FLUSH_RAM_BUFFER_SIZE JAVA_OPTS="${JAVA_OPTS:--Xmx32g -d64 -server}" "
I am using a shared machine on which I am allowed to use up to 32GB of RAM. But the indexing process gets killed because it reaches memory usage limit. I also tried JAVA_OPTS="${JAVA_OPTS:--Xmx16g -d64 -server}", however this setting throws "java.lang.OutOfMemoryError: Java heap space" error.
Please help me with setting up OpenGrok to index large sources.
Thanks in advance!!
Could you try with the latest 1.1 rc ?
Dne st 10. 10. 2018 17:56 uživatel Lokendra-Saini [email protected] napsal:
Hello all,
I am trying to index 19 GB source files with Opengrok-1.0. But indexing is failing every time. Following is the relevant information Java - jdk1.8.0_51 Ant - apache-ant-1.9.0 Source files - 19GB (Single project) My changes to OpenGrok Script " OPENGROK_GENERATE_HISTORY=off OPENGROK_SCAN_REPOS=false OPENGROK_VERBOSE=true OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 256"
export OPENGROK_GENERATE_HISTORY export OPENGROK_SCAN_REPOS export OPENGROK_VERBOSE export OPENGROK_FLUSH_RAM_BUFFER_SIZE JAVA_OPTS="${JAVA_OPTS:--Xmx32g -d64 -server}" "
I am using a shared machine on which I am allowed to use up to 32GB of RAM. But the indexing process gets killed because it reaches memory usage limit. I also tried JAVA_OPTS="${JAVA_OPTS:--Xmx16g -d64 -server}", however this setting throws "java.lang.OutOfMemoryError: Java heap space" error.
Please help me with setting up OpenGrok to index large sources.
Thanks in advance!!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDNhPPGbWAJmh1G-ZerBzy6UPfli0ks5ujhiGgaJpZM4XVpkb .
I need to use OpenGrok-1.0 for my use-case. Could you please suggest any changes for the same. I will also try OpenGrok1.1 and will update here.
If you really need to use 1.0 then run the indexer with -XX:+HeapDumpOnOutOfMemoryError
(and -XX:HeapDumpPath=
to specify where the file should be saved) and take a look at the dump with MAT (https://www.eclipse.org/mat/) to see what are the biggest consumers.
Any luck ?
Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options.
What are these files ?
Dne út 16. 10. 2018 19:26 uživatel Jake VanEck [email protected] napsal:
Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430325036, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .
What are these files ? Dne út 16. 10. 2018 19:26 uživatel <[email protected]> napsal: … Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .
Corporate code files from multiple code repositories.
I mean - is there something special about them ?
Dne út 16. 10. 2018 19:42 uživatel Jake VanEck [email protected] napsal:
What are these files ? Dne út 16. 10. 2018 19:26 uživatel Jake VanEck < [email protected]> napsal: … <#m_-997926693750626801_> Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment) https://github.com/oracle/opengrok/issues/2407#issuecomment-430325036>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb .
Corporate code files from multiple code repositories.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430330204, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDKOquWvZSFkqfV2CSVeInu-VwFPZks5ulhpzgaJpZM4XVpkb .
I wouldn't consider them special at all. Just many many years and lines of code, plus random other files included in any corporate code repository.
I was just making the point that 19GB of source files isn't very large in comparison and opengrok can handle it (with only minor tweaks, if any). The newer versions (1.1+) are getting even better. In fact, I've indexed our companies network drives (for fun) that were multiple TB's just to see if it would work (and it did).
My only question for this person would be what are the size and types of files that he's indexing because I have noticed issues with large XML files and some other types (like vdx) but I feel like that's a different conversation.
I mean - is there something special about them ? Dne út 16. 10. 2018 19:42 uživatel <[email protected]> napsal: … What are these files ? Dne út 16. 10. 2018 19:26 uživatel < @.***> napsal: … <#m_-997926693750626801_> Our repo is around 450GB. With version 1.0 or older, I've found it's best to create a small index (with a subset of files) and/or remove the files that cause the memory errors until you can get a full index to create and then slowly take out the ignore files options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment) <#2407 (comment)>>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDLvCN7c6QR9gMAcXtNoAzo-ahCRAks5ulhbVgaJpZM4XVpkb . Corporate code files from multiple code repositories. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2407 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDKOquWvZSFkqfV2CSVeInu-VwFPZks5ulhpzgaJpZM4XVpkb .
@vladak - Sorry for replying late. The indexing got completed with 1.0 only. The settings were JAVA_OPTS="${JAVA_OPTS:--Xmx16384m -d64 -server}", and I was allowed to use 32GB of RAM in the machine. So my mistake was to allow the java heap to grow upto 32GB and my machine only had 32GB of RAM. One should mention the max Java heap size, less than the system's RAM.
@jvaneck- Have many questions for you, hope you can find out some time for me :-).
- What were your changes to index 450 GB repo?
- What was the configuration of the system?
- How long it took to index the data?
On what version do you want me to answer the questions based from?
Have many questions for you, hope you can find out some time for me :-).
1. What were your changes to index 450 GB repo? 2. What was the configuration of the system? 3. How long it took to index the data?
It would be great if the version is 1.0.
Version 1.0 is better than 0.x versions but not as good as 1.1rc's.
My logs are showing that 1.1rc38 indexed 3ish TB in 2 days 6 hours. Although I'm not 100% sure if that is correct or not.
1.0 and especially 0.x versions usually took at least a month (usually multiple months) or more to index 250GB because of the number of times it would stop/fail and I would have to start it again.
Ironically, I've moved a lot of this to run 100% off the network so that I can have multiple machines working on the differences processes. I realize this is a very poor implementation and it was mostly done as a stepping stone due to the amount of time I spend on this (I do it in my spare time at work).
I have played with many different options but the thing I have found to work the best is to not try to index everything at once. Getting the index to create with a small subset and then adding to the index incrementally seems to work best. I also pay close attention to the logs and ignore files that cause it to error out.
Here is an example of one of my initial index create scripts (for windows). The specs of the windows machine are 16GB of ram, Intel Xeon E5-2690 @ 2.90ghz (3 processors). I typically try to run absolutely nothing but 1 index for the initial "create". I then run this script multiple times until it completes without adding more files. Then, I add more files to the source path and repeat the process.
It's also worth pointing out that my corporate lan which I'm running this off of (when working well) can transfer up to 1Gbps and will often hit that when transferring large files
java -Djava.util.logging.config.file=Y:\grok-g\logging.properties -jar Y:\grok-g\bin\opengrok.jar ^
-S ^
-s G:\ ^
-d Y:\grok-g\data ^
-W Y:\grok-g\etc\configuration.xml ^
-U localhost:2430 ^
-G ^
-T 16 ^
-z 1 ^
-i *test* ^
-r on ^
-c Y:\grok-g\bin\ctags.exe ^
-P ^
-O on ^
-a on ^
-w /g
we tend to ignore problematic files(there are ignore options for indexer on file or directory regexps), but big files or files with long tokens used to be a problem in old opengrok versions, we tried to improve analyzers to chew on anything I think xml analyser is not included in https://github.com/oracle/opengrok/blob/master/opengrok-indexer/pom.xml#L262 and I am seriously tempted to include it there and the OOM issues should stop, since we will limit the tokens to 32k (I think I also had more language analyzers there, perhaps we can do it for all analyzers, the false positives hit by this limit will be a sacrifice that we can live with, after all solr/lucene do it by default anyways)
+1 for including the XML analyzer
Dne čt 18. 10. 2018 10:43 uživatel Lubos Kosco [email protected] napsal:
we tend to ignore problematic files(there are ignore options for indexer on file or directory regexps), but big files or files with long tokens used to be a problem in old opengrok versions, we tried to improve analyzers to chew on anything I think xml analyser is not included in https://github.com/oracle/opengrok/blob/master/opengrok-indexer/pom.xml#L262 and I am seriously tempted to include it there and the OOM issues should stop, since we will limit the tokens to 64k (I think I also had more language analyzers there, perhaps we can do it for all analyzers, the false positives hit by this limit will be a sacrifice that we can live with, after all solr/lucene do it by default anyways)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/opengrok/issues/2407#issuecomment-430925952, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzGDBiEVUSsRJuofzXmQ-414h1d8aBNks5umD8VgaJpZM4XVpkb .
+1 for doing anything/something to fix the OOM issues. Even if the indexer (did something like) automatically adding files that caused it to fail to an auto 'ignore list' that would be nice.