csvbase icon indicating copy to clipboard operation
csvbase copied to clipboard

`sentry.io` and `privacy.txt`

Open KOLANICH opened this issue 2 years ago • 3 comments

Disclaimer: I'm not a lawyer at all and I'm not your attorney, it's up to you on your own risk to choose your actions, I'm not liable for anything

privacy.txt says

If you experience a crash (eg a 500 error) we collect extra information about your configuration, strictly for the purposes of debugging that crash. That is shared with Sentry (https://sentry.io/) and they retain it for 30 days.

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/func.py#L25-31 says

def set_current_user(user: User) -> None:
    g.current_user = user

    # This is duplication but very convenient for jinja templates
    g.current_username = user.username

    sentry.set_user(user)

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L47 says

from ..func import (
    ...,
    set_current_user,
    ...
)

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1440 says

def sign_in_user(user: User, session: Optional[Any] = None) -> None:
    """Sets the current user and sets a cookie to keep them loged in across
    requests.

    """
    set_current_user(user)

and sign_in_user is used in

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L192

@bp.post("/new-table")
def new_table_form_submission() -> Response:

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1294

@bp.route("/register", methods=["GET", "POST"])
def register() -> Response:

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1332

@bp.route("/sign-in", methods=["GET", "POST"])
def sign_in() -> Response:

and CopyView.post https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L793 , which is used in

https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L851

bp.add_url_rule(
    "/<username>/<table_name>/copy", view_func=CopyView.as_view("copy_view")
)

.

Shouldn't one change the text of privacy.txt to make it say that "sentry.io gets an uuid (a pseudonimized identifier linked to a user's record in our database) of a user when he does the following actions: register, sign in, creates a new table or copies an existing table"?

KOLANICH avatar Sep 20 '23 07:09 KOLANICH

Hi, thanks for taking the time to raise this issue.

There are a few things here, let me know if you agree:

  1. I naively assumed that calling sentry.set_user wouldn't send any information to sentry - until (and if) that user saw an uncaught exception. I should review their source to confirm that this is the case.
    • btw set_current_user is actually used in many, many places, not just when you register, create a table, copy a table, etc. It's called on nearly every page load if you're browsing while signed in.
  2. I should make clear that sentry also get your user uuid if you're signed in (which, as you say: is a randomised pseudonym)
  3. You haven't mentioned this but I think I should probably improve the language in privacy.txt to make it clear it's not just signed in users who get their data (eg browser version, operating system) sent to sentry - actually that can happen even if not signed in.
  4. And finally, perhaps I should look at sending sentry just a hash of the user-uuid? That might be better, perhaps in the future I might want to expose the user-uuid somewhere.

Thoughts?

calpaterson avatar Sep 20 '23 13:09 calpaterson

calling sentry.set_user wouldn't send any information to sentry

Thanks for the info. In fact I have never personally used sentry.io so I'm unfamiliar to the API they have. I have checked the source of sentry_sdk and it seems you are right. I'd be nice to add a some words into the docstring saying that sentry_sdk.set_user call doesn't trigger event reporting, but just adds some info to other events. 2. 👍 3. 👍 4. I guess hashing defeats the purpose of sending a user id to Sentry, if I understand right that the purpose of that service is to collect data points and then somehow analyse them, and when there is anything wrong, then (I have never sentry.io myself and I may be wrong) you may want to map the id on Sentry to id on your service. While you still can undo hashing, it would require you to hash whole your database of users if you need that. I guess that instead of hashing one can apply a block cipher, with exact details depending on requirements. For example if one wants to make points somehow unlinkable to each other on Sentry side, for example making Sentry unable to detect patterns cross-sessions, one can tweak the cipher with session id, one more option is to somehow map time to some bucket id (roughly speaking, measure the time from session start and divide by some big value (month, week, 3 days, etc), and maybe even reset the start time after some long non-use period) for tweaking too. Also I wonder if uuids would have any use after it instead of integer incrementing ids. Of course they can be used to merge databases of different instances into one, but IMHO it is quite a rare operation that can be dealt separately (by assigning new ids) and doesn't worth an overhead on using uuids internally.

KOLANICH avatar Sep 20 '23 14:09 KOLANICH

From my point of view, Sentry is just a place to record (automatically) information about crashes. For a webapp like csvbase, that pretty much just means 500s. It groups different instances of an error together and shows a view where you can see what happened: the url they were on, what version of the server was running at the time, etc. I'm only sending the user id so I can map it back to see who had the error as part of manual investigation.

I initially chose uuids for the security reasons - to avoid user enumeration bugs. Same for tables. There is some overhead to using them but postgres has a native type for uuids and so that is fairly small and I can probably go for a while until there is any real problem. Switching id types as you might imagine would mean changing a lot of code now :)

Alright, I'll use this issue to track the job of improving the privacy policy, if that's ok :)

calpaterson avatar Sep 21 '23 07:09 calpaterson