pokeminer WIP: Improve scantime

Update 1.2

Reworked distance calculations, should be seriously faster now. To achieve that I had to add scipy and numpy and pyproj as dependencies.

Would appreciate if some folks with bigger datasets than me(>2k spawn points) could test it.

Update 1.1

Did some refactoring to improve code quality and fixed something which may very well have lead to an infinite loop.

Update 1

What do you need to run the PR at this stage?

set a new variable in config file SAMPLES_PER_POINT = 5. Increasing the value increases the computation time needed before we can start, but results in a smaller set of scanning points. However increasing it beyond 10 won't improve the result significantly.
the area needs to be pre-scanned with the normal version as you need data about the spawn points.

Will it reduce server request?

No, not unless you reduce the grid parameters in the config file. You can expect a decrease of points to scan of about ~50%, so you could use 50% less workers to get the same results as before which in turn would lead to fewer server request.

Still missing:

better integration into the worker cycle. Will start working on this part when PR #216 is merged.
setting variable to decide if we should scan everything normally first or can directly work with reduced scanset.
~~code quality is seriously lacking, will be fixed soon.~~ fixed
scanning efficiency can be improved even further. But as this also requires changes on the worker side I'll stall until PR #216 merged.

TL;DR: set `SAMPLES_PER_POINT = 5` in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

edit: I just realized I was talking bullshit about the efficiency gain. It wasn't 30% more effective for me. It reduced the scanning points to 30%, so the speedup is way higher. Corrected the corresponding passages.

Reduce the scan time by calculating a set of scan points which covers all distinct spawns. This leads to a reduced amount of points to scan without missing a single pokemon.

As the title states, it's still work in progress. What's missing at this point:

A proper integration into the worker cycle, so that it makes a 2 hours session every 24 hours in which it makes an extensive scan sweep to detect possible changes in spawn points. Right now there are no sweeps and nothing, it just directly starts with the reduced set. I'm not quite sure how to accomplish this, would love some help here.
a setting in which one can inform the program if the area already has been scanned(so that it can already start with the reduced scan point set)
code quality: It's a prototype which was hacked together as fast as possible, so there will be improvements here xD

Aug 11 '16 23:08 rollator

Would this mean a reduction of requests sent to Niantic Servers ?

Aug 12 '16 17:08 SapitoSucio

That depends on the user. If you don't adjust the grid parameter in the config file it will still make more or less the same amount of requests but each point will be scanned more often.

As there are fewer points to scan(500 instead of 1500 for me),you could use fewer accounts to scan the same area, or keep the number of accounts and scan a larger area. Or don't change anything at all and have a higher refresh rate :)

Right now I'm working with only 2/3 of the workers i did without these changes but also reduced the workload for every worker.

Aug 12 '16 17:08 rollator

Hello @rollator , just to be clear - as this is right now, I need a previously filled database and your change will draw the spawn points that need to be scanned from there, right? Also, what is the definition of a "sample" and a "point" as in SAMPLES_PER_POINT ?

Aug 12 '16 20:08 crhbetz

Oh yes, sorry I should have mentioned that:

As of right now you need a pre-filled database for the area you want to scan. (This certainly will be changed in the future)
SAMPLES_PER_POINT is a value that's used for calculation of the set of scanning points: As the exact computation would be unfeasable by all means, the minimal needed set can only be approximated. First all distinct spawnpoints are pulled from the database. These are the "points". Now as a second step, we visit each of these points, and generate a specific amount of circles(SAMPLES_PER_POINT) around the point so that the point will be inside the generated circles and add all of these to a collection of circles. In a third step we iterate over the circles and choose those which cover the most points until all points are covered. As the generation of circles is random, more circles leads to a better set to choose from. However it's also computationally expensive to calculate the distances from points to circle center. For example if we have 2000 spawn points and SAMPLES_PER_POINT = 5, we'll generate 5 * 2000=10'000 circles. Now we need to check the distances for 10'000 circles to 2000 points which are 10'000 * 2000=20'000'000 pairs to calculate(actually the real number is a bit lower thanks to fancy datastructures called interval trees). So increasing the SAMPLES_PER_POINT leads to fewer scanpoints but at some point there won't ve a better solution. But it also drastically increases the computation time

Aug 12 '16 20:08 rollator

With my data set, Im getting Traceback (most recent call last): File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 378, in spawn_workers(workers, status_bar=args.status_bar) File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 308, in spawn_workers points = utils.calculate_minimal_pointset(spawn_points) File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 210, in calculate_minimal_pointset foo = map_points_to_circles(interval_of_circles, interval_of_points) File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 106, in map_points_to_circles points = IntervalTree(map(lambda point, circles=circles: map_point(point,circles),points)) File "C:\Python27\lib\site-packages\intervaltree\intervaltree.py", line 254, in init self.top_node = Node.from_intervals(self.all_intervals) File "C:\Python27\lib\site-packages\intervaltree\node.py", line 64, in from_intervals node = node.init_from_sorted(sorted(intervals)) File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 185, in lt return self.cmp(other) < 0 File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 170, in cmp return -1 if self.data < other.data else 1 RuntimeError: maximum recursion depth exceeded in cmp

Aug 13 '16 10:08 vDrag0n

@vDrag0n I just added a small commit which could fix it for you, but I'm not really sure. Could you try it with the new version and tell if it works now and what your number of spawnpoints was? (Will be printed in the console before the most ressource intensive part, so you'll have plenty of time to take note :)

Aug 13 '16 11:08 rollator

you use variable names like "foo", "meh" or "anything" which should really be cleaned up
in util line 220 you start a while loop with a single break. Easy fix: "while condition".
you sometimes import something like math inside a function without using it
you have a funcion with something like 100 lines of code. Maybe break that up

Aug 13 '16 13:08 Aiyubi

you use variable names like "foo", "meh" or "anything" which should really be cleaned up
in util line 220 you start a while loop with a single break. Easy fix: "while condition".
you sometimes import something like math inside a function without using it
you have a funcion with something like 100 lines of code. Maybe break that up

Thanks for the styling tips but as as I already stated in very first post, I'm very much aware that point code quality is seriously lacking at this point. It's a prototype and more of a POC rather than code I'd actually want to be merged. As the title suggests, this PR is still very much Work In Progress and nowhere near production quality.

Aug 13 '16 13:08 rollator

Using your spawn point retrieval sql, i'm pulling 26k records over a 1.2million row database.

I've tried changing the recursive limits of python, and just setting the SAMPLES_PER_POINT to 1

edit: Tried the small fix. Takes about 20mins to calculate the pointset, but looks like its works now.

Aug 13 '16 17:08 vDrag0n

that did the trick? great, thank you for testing :D would you mind sharing some data about scanning efficiency gain, for example how many points you had to scan before versus now? Would be really interesting to see how much(if at all) one can expect to gain on average, as for now I only have my numbers(~500 instead of 1700)

The long runtime isn't really unexpected, as the calculations are really computational intensive and calculating distances on earth takes quite some time compared to a flat plane :/

Aug 13 '16 18:08 rollator

It all depends on how dense your spawn points are and where you're scanning. Since Im scanning a large area with parks, schools, and inlets, Im saving about 50%.

For the speed problem, Its only using one core. Im sure it could be quicker if you just multi threaded it.

Aug 13 '16 18:08 vDrag0n

It all depends on how dense your spawn points are and where you're scanning.

Exactly that's why I'm asking, so we can have a rough estimate which doesn't only depend on 1 scan location

As for the speed problem, that's definitly something I'll be looking into

Aug 13 '16 18:08 rollator

tested this: circles = IntervalTree(map(lambda interval_circle, points=points: map_circle(interval_circle, points), circles)) this line is running for 4h and still not finished :/ btw: 16047 spawnpoints

Aug 13 '16 19:08 Aiyubi

what's your SAMPLES_PER_POINT value?

Aug 13 '16 20:08 rollator

@rollator 2h on 5, realized it might be the problem and then did 4h on 1 Debugging it seems that the map part of the line I just mentioned alone will take just over 1h. And then it will need to do create the tree....

Aug 13 '16 20:08 Aiyubi

By default, function get_cell_ids in pgoapi/utilities.py set cell_id radius to 1000 meters. Maximum radius allowed by server is 1500 meters and maximum cell number allowed is 100 cells.

In worker.py, request get_cell_ids use default cell radius, which mean in 70m scan_radius(default) worker will still requesting map objects for 1000 m radius to server. This will lead to more data on request and response, and time to process objects in 1000 m radius. I don't know how much minimum radius is allowed, but I set mine to 500 m.

If we look into get_map_objects response, there will be wild_pokemons, catchable_pokemons, nearby_pokemons, and spawn_points. spawn_points contain latitude and longitude, which could be same as latitude and longitude in sightings table. spawn_points could be pokemon that is too far away from worker. I suspect if the scan_radius is set too wide, worker will posibly miss the pokemon because not close enough to the spawn_points, and related point will be mislabelled.

spawn_points could be in almost 1500 meters away from worker, what about start from spawn_points on map objects response?

Set workers to scan on maximum radius (or 1000 as default).
Run scanning for about an hour or more.
Store any spawn_points get from this scanning process.
Even better if there is another worker that check each spawn_points for pokemon sighting while scan process is running.
Compare spawn_points scan results with what already in the DB. Add spawn_points with no match to a new list.
Generate points with the scan_radius (mine is 80 m). Map each spawn_points (DB and no match scan results) to the points generated. 7.Point with no sightings could then excluded from scan point sets. 8.Point with no match could added to scan point sets with mark. 9.If point with no match still have no sightings after couple of scanning, remove it from scan point sets.

Aug 13 '16 20:08 gunawanputra

that's a big fucking problem right there, I didn't expect it to be scaling THAT bad. I look into it as fast as possible. Concerning possible solutions, do you think adding numpy as a dependecy would get accepted? It would really seed up large array operations :)

Aug 13 '16 20:08 rollator

@gunawanputra right now you won't any information about the location of a pokemon outside the 70m radius, so you still needed to scan all the points within the cell when you know that there are pokemons

Aug 13 '16 20:08 rollator

@rollator update: turns out the way python performs the map, it is ~1h including the treecreation. Will need to perform some more debugging where it runs all the other hours

Aug 13 '16 21:08 Aiyubi

@Aiyubi I would guess that the 3 hours are in that huge horrible while True: loop :/ Right now I'm refactoring the code a bit, when I tinkering around to improve things I just had enough working with this horrible code.

Aug 13 '16 21:08 rollator

Sooo it finished with samples = 1 16047 spawnpoints 4482 optimized scanpoints 7884 non optimized scanpoints

Aug 14 '16 06:08 Aiyubi

@Aiyubi could you test it again please? It should run MUCH faster now which could enable you increase the sample size and decrease the number of optimized scanpoints

Aug 14 '16 14:08 rollator

I have over 300k db. Will be trying this soon.

Aug 14 '16 15:08 theresthatguy

@theresthatguy I wasn't talking db entries, I was talking spawnpoints. My db contains ~350k entries, so we're pretty similar in that sence, so I guess it should run no more than a few seconds for you and you can even set the SAMPLES_PER_POINT to a higher value(~40 was fine for me) to achieve better results :)

Aug 14 '16 16:08 rollator

Number of spawnpoints: 16047
Generating samples...
Starting distance calculations
Distance calculations finished
Calculating best scan points
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scanpoints.py", line 345, in calculate_minimal_pointset
    circle_indexes = choose_best_scan_points(inside_matrix)
  File "scanpoints.py", line 276, in choose_best_scan_points
    mult_matrix = scipy.sparse.diags(identity_array[0])
TypeError: diags() takes at least 2 arguments (1 given)

Aug 14 '16 21:08 Aiyubi

I've tested it and it works much quicker after the refactor. Only the checking of spawn points takes a long time.

Aug 15 '16 00:08 vDrag0n

@Aiyubi what python & scipy version are you using? This seems like it's an older scipy version

@vDrag0n by checking of spawn points you mean the "safety check"? That part will be removed asap, however I first wanted confirmation by other testers that the part where it print 'lonely circle found' never gets executed. Haven't removed it because it shouldn't ever execute, but I wanted to really be sure. If you want you can change line 359 in scanpoints.py from scan_points = _check_scan_point_set_for_safety(points, scan_points) to #scan_points = _check_scan_point_set_for_safety(points, scan_points) and it won't do this slow check. How long did it take for you to run it now?

Aug 15 '16 00:08 rollator

about 5mins with 10 samples before the safety check.
Its hard to check if it gets printed or not. Theres no user input requirement, or logging if there is one, since the worker will go straight to scanning. But I did purposely add a pause after completing calculations and did not see any lonely circles with my data set

Aug 15 '16 01:08 vDrag0n

you're right. Added commit to log results to minimal_scan_calculations.log

Aug 15 '16 05:08 rollator

@rollator you were right. Did pip for the wrong python version and had an old scipy

you are missing pyproj in the requirements
I realized that this does not take gyms into account. Is it unlikely enough to assume we do not need this? I mean for my ~100km² that would be just 449 extra points (with obviously bigger range)
seems to be working pretty good

Aug 15 '16 15:08 Aiyubi

pokeminer pokeminer copied to clipboard

WIP: Improve scantime

Update 1.2

Update 1.1

Update 1

What do you need to run the PR at this stage?

Will it reduce server request?

Still missing:

TL;DR: set SAMPLES_PER_POINT = 5 in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

pokeminer
pokeminer copied to clipboard

TL;DR: set `SAMPLES_PER_POINT = 5` in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.