backend icon indicating copy to clipboard operation
backend copied to clipboard

slow solr imports

Open hroberts opened this issue 4 years ago • 2 comments

solr imports are very slow.

here are the last ten imports and the size of the import queue:

mediacloud=# select * from solr_imports order by solr_imports_id desc limit 10;                                                                                                                                                                                    
 solr_imports_id |        import_date         | full_import | num_stories                                                                                                                                                                                          
-----------------+----------------------------+-------------+-------------                                                                                                                                                                                         
         1492176 | 2019-10-29 10:10:22.209454 | f           |      200000                                                                                                                                                                                          
         1492175 | 2019-10-26 04:50:56.321699 | f           |        1067                                                                                                                                                                                          
         1492174 | 2019-10-26 04:41:52.876781 | f           |        1109                                                                                                                                                                                          
         1492173 | 2019-10-26 04:38:48.971453 | f           |        1866                                                                                                                                                                                          
         1492172 | 2019-10-26 04:37:44.932217 | f           |        1629                                                                                                                                                                                          
         1492171 | 2019-10-26 04:36:40.717088 | f           |        1894                                                                                                                                                                                          
         1492170 | 2019-10-26 04:35:36.210169 | f           |        2029                                                                                                                                                                                          
         1492169 | 2019-10-26 04:34:32.072143 | f           |        2034                                                                                                                                                                                          
         1492168 | 2019-10-26 04:33:28.680186 | f           |        1343                                                                                                                                                                                          
         1492167 | 2019-10-26 04:32:25.855111 | f           |        1051                                                                                                                                                                                          
(10 rows)                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                   
mediacloud=# select count(*) from solr_import_stories;                                                                                                                                                                                                             
  count                                                                                                                                                                                                                                                            
---------                                                                                                                                                                                                                                                          
 1760927                                                                                                                                                                                                                                                           
(1 row)                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                   

The current container log shows a bunch of entries like this:

2019-10-29T14:26:20.553449874Z INFO mediawords.db.result.result: Slow query (1 seconds): with _block_processed_stories as ( select max( processed_stories_id ) [...], ({},)

I think an ancillary part of the problem is that the solr imports are slow enough that the occasional deployment resets the process and requires starting again.

hroberts avatar Oct 29 '19 14:10 hroberts

I think this is because they have been OOMing constantly as 100k stories (the default chunk size) is quite a bit to load into RAM at once.

I've increased the RAM limit from 4 GB to 8 GB and they seem to be slowly catching up. If it doesn't fix the issue, we can also reduce the chunk size to 50k or so stories and just import more often.

pypt avatar Nov 26 '19 19:11 pypt

the jobs were still dying every 20 minutes or so, so I decreased the job size to 20k. it is catching up now.

On Tue, Nov 26, 2019 at 1:23 PM Linas Valiukas [email protected] wrote:

I think this is because they have been OOMing constantly as 100k stories (the default chunk size) is quite a bit to load into RAM at once.

I've increased the RAM limit from 4 GB to 8 GB and they seem to be slowly catching up. If it doesn't fix the issue, we can also reduce the chunk size to 50k or so stories and just import more often.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_621-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T3LLF4YQOFREYRLKFLQVVZRHA5CNFSM4JGJ5ZXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHFBFI-23issuecomment-2D558780565&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=JB7cfAgBoQmjVneNDvn6CamTHh-eN5SW0Fb5qRz_goY&s=bbWx9tQY7CIVx4DygRMK8eUx-E6aE9amdEOV1qtB9g4&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2RAAOA5OD3YZTUU53QVVZRHANCNFSM4JGJ5ZXA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=JB7cfAgBoQmjVneNDvn6CamTHh-eN5SW0Fb5qRz_goY&s=NAU7bc0vFMz1UQk0NyRIVY3AmEWAqNMosBA4aM9R9hc&e= .

hroberts avatar Nov 26 '19 19:11 hroberts