monstache icon indicating copy to clipboard operation
monstache copied to clipboard

Sync from where it stopped

Open subhasniveus opened this issue 3 years ago • 9 comments

Hi,

We need to achieve the below: A. Sync from a collection B. If the sync stops in between, it should pickup from where it left off than restart from beginning C. Once whole collection is synced completely, any further changes (updates/inserts) in collection should trigger the sync and only the changes should sync D. The index in elastic search should update than overwrite, if already exists

Mongo DB version 5.0 (atlas) Elastic search version 7.14

Below are the toml file attributes:

change-stream-namespaces = [COLLECTION_NAME] direct-read-namespaces = [COLLECTION_NAME] elasticsearch-max-conns = 10 dropped-collections = true dropped-databases = true resume = true verbose = true resume-name= "res_name" index-as-update = true

[[mapping]] namespace = "COLLECTION_NAME" index="INDEX_NAME"

[[script]] namespace = "COLLECTION_NAME" script = """ ..some changes.. """

Issue description : A, C, D from above is happening and working as expected but not B

We have even tried with resume_strategy = 1 and observed no difference.

Kindly help. Let me know if there are further questions.

subhasniveus avatar Jul 25 '22 14:07 subhasniveus

@rwynn can you please help?

subhasniveus avatar Jul 26 '22 17:07 subhasniveus

monstache direct reads (aka full sync) cannot be resumed, only the change stream can be resumed. So with direct reads you can only set direct-read-stateful=true which will record when the entire set of direct read collections has been synced and not re-sync those collections on subsequent restarts.

rwynn avatar Jul 26 '22 23:07 rwynn

monstache direct reads (aka full sync) cannot be resumed, only the change stream can be resumed. So with direct reads you can only set direct-read-stateful=true which will record when the entire set of direct read collections has been synced and not re-sync those collections on subsequent restarts.

Thanks for the quick reply @rwynn .

Couple of followup questions:

  1. Do you mean if I remove direct-read-namespaces above it should satifisy all A, B, C & D above?
  2. When we sync views, I think we will have to mention it on direct-read-namespaces which means in this case there is no option to resume if the sync stops in between, right? any other way?

Thanks again.

subhasniveus avatar Jul 27 '22 04:07 subhasniveus

@rwynn please let know. We are kind of stuck with production deployment.

subhasniveus avatar Aug 01 '22 08:08 subhasniveus

  1. yes, should be.
  2. you can sync documents from a view via relate configurations. The change events get routed through the view before going to Elasticsearch. https://rwynn.github.io/monstache-site/advanced/#mongodb-view-replication.

rwynn avatar Aug 02 '22 03:08 rwynn

Scenario A, C, D works well but B desn't in case of fresh (initial) sync.

Scenario as below

  1. Fresh run of monstache i.e no monstache collection yet created
  2. Run long running sync in monstache process with monstache -f config.toml , say 1000 documents to sync
  3. Stop the process in between, say after 500 documents
  4. Restart the monstache process

Expectation from above scenario After step 4, the sync should start from 500th document (pending documents which are not synced earlier)

Actual Doesn't sync at all

It still does listen for changes in documents and syncs the changed documents properly

sample toml file used xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

mongo-url = "mongodb://XXXX" elasticsearch-urls = ["http://YYYY/"] elasticsearch-max-conns = 10 verbose = true change-stream-namespaces = ["monTestDB.collection_1","monTestDB.collection_2"] direct-read-namespaces = ["monTestDB.collection_1", "monTestDB.collection_2"] direct-read-stateful = true index-as-update = true index-oplog-time = true resume = true resume-write-unsafe = true resume-strategy = 1

[[mapping]] namespace = "monTestDB.collection_1" index="product_v7_test"

[[mapping]] namespace = "monTestDB.collection_2" index="product_v8_test"

[[script]] namespace = "monTestDB.collection_2" script = """ module.exports = function(doc, ns, updateDesc){ if(doc._id === 2){ for (i = 0; i < 99999999999999; i++) { console.log("==================="+i); } } return doc; } """

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Here I have simulated long running sync using for loop.

Below is the screenshot of the logs after restart (step 4). Kindly note the versions from the screenshot. monstache_logs

Few of observations below:

  1. directreads collection has collection names inserted immediately after sync is started and doesn't wait to complete
  2. tokens collection is not created until full sync is completed and a document is changed
  3. resume-strategy = 0 doesn't create tokens collection even when mongodb version is 6+ (the one which I use)

Appreciate any help/feedback @rwynn

subhasniveus avatar Sep 21 '22 17:09 subhasniveus

Hi @subhasniveus there were some fixes related to direct-read-stateful that went into monstache 6.7.10. This was to ensure directreads collection is not written to when the process exits prematurely. Specifically, https://github.com/rwynn/monstache/commit/60c486778f9b12c6639b36149eb47d03ef0d8dfa.

From the output it looks like you are using monstache 6.7.7. Can you try with the latest version to see if this gives better results?

The tokens collection is only populated in response to a change event not a direct read of existing data. The tokens collection is only used when the resume strategy is 1, this is expected.

rwynn avatar Sep 22 '22 01:09 rwynn

The expectation of direct reads resuming from where they were interrupted (e.g. start from 500th document) will not occur. Direct reads always sync everything in the collection.

The resume option only applies to change events not the direct reads. When direct reads are stateful they will not occur at all if they have previously been recorded as completed.

rwynn avatar Sep 22 '22 01:09 rwynn

Hi @subhasniveus there were some fixes related to direct-read-stateful that went into monstache 6.7.10. This was to ensure directreads collection is not written to when the process exits prematurely. Specifically, 60c4867.

From the output it looks like you are using monstache 6.7.7. Can you try with the latest version to see if this gives better results?

The tokens collection is only populated in response to a change event not a direct read of existing data. The tokens collection is only used when the resume strategy is 1, this is expected.

@rwynn thank you for the response.

Tried the same scenario with updated version 6.7.10 and still the collection name is written to directreads collection immediately I start the process. When I restart, it doesn't sync either from start or from where I killed the process.

subhasniveus avatar Sep 22 '22 04:09 subhasniveus

Hi @subhasniveus,

Did you drop the directreads collection before running the test? Monstache will not clear this collection, it has to be done manually.

Also, can you attach the monstache log and indicate the approximate time you are stopping the process?

rwynn avatar Sep 24 '22 00:09 rwynn

Hi @subhasniveus,

Did you drop the directreads collection before running the test? Monstache will not clear this collection, it has to be done manually.

Also, can you attach the monstache log and indicate the approximate time you are stopping the process?

Yes. @rwynn I've dropped the whole monstache DB before executing the config file newly. In fact, it doesn't matter when I stop it and immediately directreads collection is created and names are inserted. I press Ctrl+C to terminate the process.

Please find the attached logs file logs_initial.txt generated at first run (newly run) terminated in between and logs_second_run.txt when run second time. logs_initial.txt logs_second_run.txt

Also please find monstache, directreads collection screenshot when newly run below monstache_collection directreads_collection

Please let me know if you need more data for analysis. Thank you.

subhasniveus avatar Sep 25 '22 13:09 subhasniveus

Hi @subhasniveus, Thanks for the logs. I'm not sure what could be going on. The function at this line should be the only code saving to the directreads collection. https://github.com/rwynn/monstache/blob/1868ebba0d221fc00ee4473bb3bcbae0cd15c224/monstache.go#L1677 And that function only has 1 reference and should be preceded in the logs by a log statement https://github.com/rwynn/monstache/blob/1868ebba0d221fc00ee4473bb3bcbae0cd15c224/monstache.go#L4453

infoLog.Println("Direct reads completed")

I didn't notice that line being logged in the output of the 1st run. So I'm a little puzzled why the names of those collections would have been stored in the directreads collection before the 2nd run.

rwynn avatar Sep 25 '22 17:09 rwynn

Actually, nevermind I do see it in the middle.

What I don't see is this log line https://github.com/rwynn/monstache/blob/1868ebba0d221fc00ee4473bb3bcbae0cd15c224/monstache.go#L4302

Starting clean shutdown

rwynn avatar Sep 25 '22 17:09 rwynn

I added some commits that should reduce false positives (direct read ns written as completed before they really are). However, it would take a much larger change to ensure that there were actually no errors associated with any of the direct reads.

If you notice any errors in the logs it would be best to reset the direct read state and do a full sync.

rwynn avatar Sep 25 '22 19:09 rwynn

Thanks @rwynn Please update here when the commits are released. Yes, as you said resetting the direct read state and doing a full sync will be best but a little problematic when the sync is a very long running process like which requires a day or two. If possible, kindly add the feature so that it can identify where it terminated and restart from that point.

I will close this issue. Thanks for your time.

subhasniveus avatar Sep 26 '22 05:09 subhasniveus