`sentry.io` and `privacy.txt`
Disclaimer: I'm not a lawyer at all and I'm not your attorney, it's up to you on your own risk to choose your actions, I'm not liable for anything
privacy.txt says
If you experience a crash (eg a 500 error) we collect extra information about your configuration, strictly for the purposes of debugging that crash. That is shared with Sentry (https://sentry.io/) and they retain it for 30 days.
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/func.py#L25-31 says
def set_current_user(user: User) -> None:
g.current_user = user
# This is duplication but very convenient for jinja templates
g.current_username = user.username
sentry.set_user(user)
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L47 says
from ..func import (
...,
set_current_user,
...
)
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1440 says
def sign_in_user(user: User, session: Optional[Any] = None) -> None:
"""Sets the current user and sets a cookie to keep them loged in across
requests.
"""
set_current_user(user)
and sign_in_user is used in
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L192
@bp.post("/new-table")
def new_table_form_submission() -> Response:
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1294
@bp.route("/register", methods=["GET", "POST"])
def register() -> Response:
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L1332
@bp.route("/sign-in", methods=["GET", "POST"])
def sign_in() -> Response:
and CopyView.post
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L793 , which is used in
https://github.com/calpaterson/csvbase/blob/a57b45f1c056cdbedc8f404d8ada79b2b8258c8f/csvbase/web/main/bp.py#L851
bp.add_url_rule(
"/<username>/<table_name>/copy", view_func=CopyView.as_view("copy_view")
)
.
Shouldn't one change the text of privacy.txt to make it say that "sentry.io gets an uuid (a pseudonimized identifier linked to a user's record in our database) of a user when he does the following actions: register, sign in, creates a new table or copies an existing table"?
Hi, thanks for taking the time to raise this issue.
There are a few things here, let me know if you agree:
- I naively assumed that calling
sentry.set_userwouldn't send any information to sentry - until (and if) that user saw an uncaught exception. I should review their source to confirm that this is the case.- btw
set_current_useris actually used in many, many places, not just when you register, create a table, copy a table, etc. It's called on nearly every page load if you're browsing while signed in.
- btw
- I should make clear that sentry also get your user uuid if you're signed in (which, as you say: is a randomised pseudonym)
- You haven't mentioned this but I think I should probably improve the language in privacy.txt to make it clear it's not just signed in users who get their data (eg browser version, operating system) sent to sentry - actually that can happen even if not signed in.
- And finally, perhaps I should look at sending sentry just a hash of the user-uuid? That might be better, perhaps in the future I might want to expose the user-uuid somewhere.
Thoughts?
calling sentry.set_user wouldn't send any information to sentry
Thanks for the info. In fact I have never personally used sentry.io so I'm unfamiliar to the API they have. I have checked the source of sentry_sdk and it seems you are right. I'd be nice to add a some words into the docstring saying that sentry_sdk.set_user call doesn't trigger event reporting, but just adds some info to other events.
2. 👍
3. 👍
4. I guess hashing defeats the purpose of sending a user id to Sentry, if I understand right that the purpose of that service is to collect data points and then somehow analyse them, and when there is anything wrong, then (I have never sentry.io myself and I may be wrong) you may want to map the id on Sentry to id on your service. While you still can undo hashing, it would require you to hash whole your database of users if you need that. I guess that instead of hashing one can apply a block cipher, with exact details depending on requirements. For example if one wants to make points somehow unlinkable to each other on Sentry side, for example making Sentry unable to detect patterns cross-sessions, one can tweak the cipher with session id, one more option is to somehow map time to some bucket id (roughly speaking, measure the time from session start and divide by some big value (month, week, 3 days, etc), and maybe even reset the start time after some long non-use period) for tweaking too. Also I wonder if uuids would have any use after it instead of integer incrementing ids. Of course they can be used to merge databases of different instances into one, but IMHO it is quite a rare operation that can be dealt separately (by assigning new ids) and doesn't worth an overhead on using uuids internally.
From my point of view, Sentry is just a place to record (automatically) information about crashes. For a webapp like csvbase, that pretty much just means 500s. It groups different instances of an error together and shows a view where you can see what happened: the url they were on, what version of the server was running at the time, etc. I'm only sending the user id so I can map it back to see who had the error as part of manual investigation.
I initially chose uuids for the security reasons - to avoid user enumeration bugs. Same for tables. There is some overhead to using them but postgres has a native type for uuids and so that is fairly small and I can probably go for a while until there is any real problem. Switching id types as you might imagine would mean changing a lot of code now :)
Alright, I'll use this issue to track the job of improving the privacy policy, if that's ok :)