DAGSfM icon indicating copy to clipboard operation
DAGSfM copied to clipboard

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on

Open jchen706 opened this issue 3 years ago • 17 comments

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

image

image

The second worker is idle in the picture.

image

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

jchen706 avatar Aug 20 '20 00:08 jchen706

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question: image What is the I0819 and 22381 mean in that line after the time, I highlighted?

jchen706 avatar Aug 20 '20 01:08 jchen706

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

image

image

The second worker is idle in the picture.

image

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

This is due to race condition between the three threads in master, I wrote a thread safely distribute task controller and updated it to dev branch. Now this issue should have been fixed.

AIBluefisher avatar Aug 20 '20 15:08 AIBluefisher

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question: image What is the I0819 and 22381 mean in that line after the time, I highlighted?

In a distribute system, we must specify master and worker, master is only take the responsibility of task schedule and it doesn't need to do work that a worker should do. Actually, you can run a master and a worker on a same physical server.

AIBluefisher avatar Aug 20 '20 15:08 AIBluefisher

image

Infinite Loop it seems.

image

jchen706 avatar Aug 21 '20 02:08 jchen706

Forget to commit some code. Try the newest branch!

AIBluefisher avatar Aug 21 '20 14:08 AIBluefisher

As I merge the code manually, and I currently don't have more than one machine to test the distributed code. Just keep this issue open if you have any problem.

AIBluefisher avatar Aug 21 '20 14:08 AIBluefisher

And also check you log in log directory, if you start the master and worker correctly, you should see the info like below:

I0821 22:31:43.197476 29655 image_clustering.cpp:450] Analysing Statistics...
I0821 22:31:43.197505 29655 image_clustering.cpp:356] Images Clustering Config:
- image upperbound: 17
- completeness ratio: 0.7
- cluster type: NCUT
Images Clutering Summary:
Clusters number: 3
Total graph cutting time: 0.003766 seconds
Total graph cutting number: 1
Total graph expansion time: 0.022821 seconds
Total graph expansion number: 1
Total time took: 0.026587 seconds
Total iteration number: 0
Images number expanded from 52 to 83
Repeated Ratio: 0.596154
Edges number reduced from 952 to 447
Lost ratio: 0.530462
I0821 22:31:43.197710 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197749 29655 image_clustering.h:49] 158 edges
I0821 22:31:43.197821 29655 image_clustering.h:43] 27 nodes
I0821 22:31:43.197854 29655 image_clustering.h:49] 162 edges
I0821 22:31:43.197918 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197948 29655 image_clustering.h:49] 127 edges
I0821 22:31:43.198192 29655 distributed_mapper_controller.cpp:202] cluster 0 has 28 images.
I0821 22:31:43.428670 29655 distributed_mapper_controller.cpp:202] cluster 1 has 27 images.
I0821 22:31:43.681368 29655 distributed_mapper_controller.cpp:202] cluster 2 has 28 images.
I0821 22:31:44.132424 29658 distributed_task_manager.inl:85] Transferring images to worker #0.
I0821 22:31:44.134135 29659 distributed_task_manager.inl:111] start update running info
I0821 22:31:44.136473 29659 distributed_task_manager.inl:113] end update running info
I0821 22:31:44.342805 29658 distributed_task_manager.inl:91] Transferring images to worker #0 completed.
I0821 22:31:44.343420 29658 distributed_task_manager.h:47] Call run sfm
I0821 22:31:44.704735 29658 distributed_task_manager.h:49] end call RunSfM

Make sure your output has distributed_task_manager.h:47] Call run sfm.

AIBluefisher avatar Aug 21 '20 14:08 AIBluefisher

About the Feature Extraction Error with the Gerrard Hall Dataset from colmap: https://colmap.github.io/datasets.html#. I just have the gerrard-hall folder with a log file folder and images folder. It's a colmap error.

The Feature extraction process killed itself.

image Last error for me for now.

jchen706 avatar Aug 22 '20 03:08 jchen706

Try to use the GPU version feature extraction. Or it would require much time for large scale datasets, and could be killed by operating system.

AIBluefisher avatar Aug 22 '20 10:08 AIBluefisher

I met the same problem. I have 3 computers in total, I set one of them as master, the other two as workers. But when master started, both workers' status are IDLE.

Cluster Id IP Worker Status Progress Task Status Time 0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00 1 10.134.92.104:8080 IDLE 0/0 % mapping 00:00:00 I0929 09:12:13.576000 30726 distributed_task_manager.inl:111] start update running info I0929 09:12:13.620071 30726 distributed_task_manager.inl:113] end update running info

Have you solved this problem?

After approximately 10 minutes, worker 0 start running. However, worker 2 remains IDLE.

Yzhbuaa avatar Sep 29 '20 09:09 Yzhbuaa

Could you show me the running information of workers? Make sure the command from the master has been sent to workers, and workers received the command. From the Progress item, it seems the data is not correctly sent or received. Maybe you should start from the distributed mode on one computer and see what's going on. So that I'm able to help you as I can.

AIBluefisher avatar Sep 29 '20 13:09 AIBluefisher

My config.txt:

1 10.134.93.68 8080 /mnt/common_storage/distributed_sfm_test/images

running information of workers:

Could not create logging file: No such file or directory COULD NOT CREATE A LOGGINGFILE 20200929-222957.3103!I0929 22:29:57.778079 3104 worker.cpp:15] Worker get running info I0929 22:29:59.047288 3104 worker.cpp:15] Worker get running info I0929 22:30:00.097246 3105 worker.cpp:15] Worker get running info I0929 22:30:01.500092 3104 worker.cpp:15] Worker get running info I0929 22:30:02.534713 3105 worker.cpp:15] Worker get running info I0929 22:30:03.569555 3104 worker.cpp:15] Worker get running info I0929 22:30:04.607694 3105 worker.cpp:15] Worker get running info I0929 22:30:05.642477 3104 worker.cpp:15] Worker get running info I0929 22:30:06.687559 3105 worker.cpp:15] Worker get running info I0929 22:30:07.713531 3105 worker.cpp:15] Worker get running info I0929 22:30:08.739782 3105 worker.cpp:15] Worker get running info I0929 22:30:09.765760 3105 worker.cpp:15] Worker get running info

Progress item:

Cluster Id IP Worker Status Progress Task Status Time 0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00 I0929 14:36:15.011899 103077 distributed_task_manager.inl:111] start update running info I0929 14:36:15.043043 103077 distributed_task_manager.inl:113] end update running info

Yzhbuaa avatar Sep 29 '20 14:09 Yzhbuaa

It seems data is not sent to workers, since the status of worker is IDLE, and the progress is 0/0 (The first 0 denote the number of registered cameras, the second 0 denotes the total number of images in this worker). My suggestion is to set logs into the distributed mode function in order to locate the bug. For example, may be the matching data is not retrived and there is no data to distribute and no SfM task is assigned to workers.

AIBluefisher avatar Sep 29 '20 14:09 AIBluefisher

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time 0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05 I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I have 133 images in total, which is divided into 2 clusters. The first cluster has 89 images.

I set the --transfer_images_to_server to 0, cause all the servers are connected to a storage server and all the images are stored on the storage server.

Yzhbuaa avatar Sep 29 '20 14:09 Yzhbuaa

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time 0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05 I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I'm busy recently, so it would not be a short time for me to reproduce this issue. You're encouraged to debug the code, and feel free to fix this issue.

AIBluefisher avatar Sep 29 '20 14:09 AIBluefisher

Okay, I will try to debug the code and my settings. Thank you!

Yzhbuaa avatar Sep 29 '20 15:09 Yzhbuaa

The problem solved after I set the --transfer_images_to_server to 1. I am wondering how to save the images transferring time using storage server's share folder?

Yzhbuaa avatar Sep 29 '20 15:09 Yzhbuaa