crocodilehunter icon indicating copy to clipboard operation
crocodilehunter copied to clipboard

Experiment with using Mozilla Location Service to detect new cell towers

Open simonft opened this issue 4 years ago • 6 comments

Mozilla Location Service has some cell sites that wigle and opencellid don't, at least in NYC. It doesn't look like there's a good way to search for a specific cell tower using the API, but they do provide gzipped versions of the full data here: https://location.services.mozilla.com/downloads.

We could either build and run our own API to query the data or download it and ingest into either the mysql database we already have or a sqlite db.

The sqlite db seems like the easiest version. Building and running an API is probably more work than it's worth before we know it's useful, and importing into the mysqldb means we'd be storing a whole copy for each individual project.

simonft avatar Jun 20 '20 19:06 simonft

I can spin you guys up a crud API on AWS Lambda if it helps, once I understand the requirements it'd likely be a days work. Let me pull that DB and I'll give you an idea on effort, as I said I think it's pretty minimal :-)

marcfielding1 avatar Aug 10 '20 15:08 marcfielding1

that would be rad! definitely want to check as many reference DBs as possible. @marcfielding1 any idea how much work and $$ that would be?

cooperq avatar Aug 14 '20 20:08 cooperq

@cooperq Depends what you need if it's just a couple of endpoints you can just buy me a beer if you ever come this way- obviously if the EFF finds themselves in need in the future hopefully you'd think of me! You can get me on LinkedIn here: https://www.linkedin.com/in/marc-fielding-a8bb3293/ - if you like we can arrange a zoom call and I'll get it spun up within a day I reckon.

I also reckon if you use API gateway edge optimized endpoints it'll give global points of presence at pretty good latency.

I'm just unpacking the data now, if it's just a search functionality you need that's really easy - I'm just thinking about security really - the easiest way would be to have JWT tokens, but then you'd require sign up interface, which I can also create a simple react app for/endpoints for. I think it's important to consider auth for it because as an organization the EFF is a target for the bad guys who'd love to spam the hell out of an endpoint, trouble is Lambda won't fall over it just scales which could get a tad expensive.

First question is really what sort of queries do you want, any of the fields in that CSV or just particular ones? Also, do you want a radial search on long and lat?

Second question do you guys have AWS infrastructure or would you like me to host it, running costs will be pretty small, you'll need an elasticsearch instance capable of holding the 13 million masts + other DB's then ongoing Lambda costs moving forward - the reason I recommend Lambda at this point is I don't have traffic profile, I'm guessing it's very sporadic without millions and millions of requests?

Also would you like the ability to create masts in the DB so that data between operators/sweeps persists and can be compared against historical data in the future - this would make sense if you create user profiles, since Bob can do his stuff and Steve could then leverage historical data to check for masts popping up and disappearing etc.

If you're recording the data gathered by crocodile hunter we can start experimenting with #32 (machine learning) to provide a probability score really you need as much data as possible on each mast then labelled samples of fake ones - which in itself isn't a hard thing to do. Unsupervised learning ie "anomaly detection" is also possible with this type of recording of data especially if the handshake is important.

The vulns you discussed in 4g in terms of handshake, as I understand the handshake is where things get funky, what if you recorded the handshake across as many masts as possible and you could weed out the ones the did odd stuff(like downgrading to 2g)?

Anyway let me know when you're free and we can iron out the details.

marcfielding1 avatar Aug 16 '20 11:08 marcfielding1

Any ideas on the above, I've got a bit of time in the next couple of weeks so could knock this out for you with the right info :-)

marcfielding1 avatar Aug 21 '20 20:08 marcfielding1

DM me on twitter and we can chat, same username as here: @cooperq

cooperq avatar Aug 25 '20 18:08 cooperq

I'm not sure if the two of you have talked on twitter, but my thoughts:

I'm not sure if we need a fancy searching api. I think we're just going to be searching by tower id and cell id, and Cooper correct me if I'm wrong but I don't think we need/want to return anything if only one of them matches. It's probably good enough to use a key/value store where the key is "{tower_id}_{cell_id}". Then if the tower isn't in the database, or if it's only been seen far away in the past, it can be flagged. For that we can probably get away with just using say an s3 bucket. Either it can require an AWS access key to use, or we can just leave it open. A malicious actor would have to make quite a few requests to it before the amount of money starts to get noticeable.

simonft avatar Sep 13 '20 17:09 simonft