backend
backend copied to clipboard
slow solr imports
solr imports are very slow.
here are the last ten imports and the size of the import queue:
mediacloud=# select * from solr_imports order by solr_imports_id desc limit 10;
solr_imports_id | import_date | full_import | num_stories
-----------------+----------------------------+-------------+-------------
1492176 | 2019-10-29 10:10:22.209454 | f | 200000
1492175 | 2019-10-26 04:50:56.321699 | f | 1067
1492174 | 2019-10-26 04:41:52.876781 | f | 1109
1492173 | 2019-10-26 04:38:48.971453 | f | 1866
1492172 | 2019-10-26 04:37:44.932217 | f | 1629
1492171 | 2019-10-26 04:36:40.717088 | f | 1894
1492170 | 2019-10-26 04:35:36.210169 | f | 2029
1492169 | 2019-10-26 04:34:32.072143 | f | 2034
1492168 | 2019-10-26 04:33:28.680186 | f | 1343
1492167 | 2019-10-26 04:32:25.855111 | f | 1051
(10 rows)
mediacloud=# select count(*) from solr_import_stories;
count
---------
1760927
(1 row)
The current container log shows a bunch of entries like this:
2019-10-29T14:26:20.553449874Z INFO mediawords.db.result.result: Slow query (1 seconds): with _block_processed_stories as ( select max( processed_stories_id ) [...], ({},)
I think an ancillary part of the problem is that the solr imports are slow enough that the occasional deployment resets the process and requires starting again.
I think this is because they have been OOMing constantly as 100k stories (the default chunk size) is quite a bit to load into RAM at once.
I've increased the RAM limit from 4 GB to 8 GB and they seem to be slowly catching up. If it doesn't fix the issue, we can also reduce the chunk size to 50k or so stories and just import more often.
the jobs were still dying every 20 minutes or so, so I decreased the job size to 20k. it is catching up now.
On Tue, Nov 26, 2019 at 1:23 PM Linas Valiukas [email protected] wrote:
I think this is because they have been OOMing constantly as 100k stories (the default chunk size) is quite a bit to load into RAM at once.
I've increased the RAM limit from 4 GB to 8 GB and they seem to be slowly catching up. If it doesn't fix the issue, we can also reduce the chunk size to 50k or so stories and just import more often.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_621-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T3LLF4YQOFREYRLKFLQVVZRHA5CNFSM4JGJ5ZXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHFBFI-23issuecomment-2D558780565&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=JB7cfAgBoQmjVneNDvn6CamTHh-eN5SW0Fb5qRz_goY&s=bbWx9tQY7CIVx4DygRMK8eUx-E6aE9amdEOV1qtB9g4&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2RAAOA5OD3YZTUU53QVVZRHANCNFSM4JGJ5ZXA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=JB7cfAgBoQmjVneNDvn6CamTHh-eN5SW0Fb5qRz_goY&s=NAU7bc0vFMz1UQk0NyRIVY3AmEWAqNMosBA4aM9R9hc&e= .