datasketch
datasketch copied to clipboard
Connecting AWS keyspaces cassandra
How to connect to aws keyspace cassandra as it asks for SSL certificate and service's user name and password ? How to pass it in MinHashLSH's constructor. The way to connect to aws cassandra using python is ` from cassandra.cluster import Cluster from ssl import SSLContext, PROTOCOL_TLSv1_2 , CERT_REQUIRED from cassandra.auth import PlainTextAuthProvider
ssl_context = SSLContext(PROTOCOL_TLSv1_2 ) ssl_context.load_verify_locations('path_to_file/sf-class2-root.crt') ssl_context.verify_mode = CERT_REQUIRED auth_provider = PlainTextAuthProvider(username='ServiceUserName', password='ServicePassword') cluster = Cluster(['cassandra.us-east-2.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142) session = cluster.connect() r = session.execute('select * from system_schema.keyspaces') print(r.current_rows)`
Have you tried passing these as part of the connection configs in cassandra
?
http://ekzhu.com/datasketch/lsh.html#connecting-to-existing-minhash-lsh
Yes, I have tried but it doesn’t accept the parameters required to connect with Cassandra. You could have a look on the parameters . I am currently able to connect to the local Cassandra but when it comes to AWS key space it is failing
I haven't used AWS Cassandra. @ostefano do you have experience with this?
No experience with Cassandra AWS either unfortunately
It is possible to connect to AWS keyspaces by slightly tweaking the kwargs and the get_session() method in CassandraSharedSession. However, AWS keyspaces does not yet support SELECT DISTINCT query needed for QUERY_GET_KEYS
. I have provided code below to demonstrate. Perhaps there is a way to rewrite the query to get around this constraint
Call algorithm with AWS keyspaces
lsh = MinHashLSH(
threshold=0.5, num_perm=128, storage_config={
'type': 'cassandra',
'basename': b'testing',
'cassandra': {
'seeds': ['cassandra.us-west-2.amazonaws.com'],
'keyspace': 'tutorialkeyspace',
'ssl_context': {ssl_context},
'auth_provider': {auth_provider},
'port': {port},
'replication': {
'class': 'SimpleStrategy',
'replication_factor': '3',
},
'drop_keyspace': False,
'drop_tables': False,
}
}
)```
Adjust Cluster instantiation for AWS kwargs
def get_session(cls, seeds, **kwargs):
_ = kwargs
keyspace = kwargs["keyspace"]
replication = kwargs["replication"]
if cls.__session is None and kwargs['ssl_context'] is None:
# Allow dependency injection
session = kwargs.get("session")
if session is None:
cluster = c_cluster.Cluster(seeds)
session = cluster.connect()
cls.__session = session
if cls.__session is None and kwargs['ssl_context'] is not None:
# Allow dependency injection
session = kwargs.get("session")
if session is None:
cluster = c_cluster.Cluster(seeds, ssl_context=kwargs["ssl_context"], auth_provider=kwargs["auth_provider"], port=9142)
# cluster = c_cluster.Cluster(seeds)
session = cluster.connect()
cls.__session = session
if cls.__session.keyspace != keyspace:
if kwargs.get("drop_keyspace", False):
cls.__session.execute(cls.QUERY_DROP_KEYSPACE.format(keyspace))
cls.__session.execute(cls.QUERY_CREATE_KEYSPACE.format(
keyspace=keyspace,
replication=str(replication),
))
cls.__session.set_keyspace(keyspace)
return cls.__session
@alexalbracht-firstparty thanks! Would you like to submit a PR to address this?
That specific query is only used once, and only to get all keys using the TOKEN function (special case).
Now, whether you can add a switch and handle AWS differently boils down to the following test:
- create a table with a PK containing a CK
- insert 4 records so that the same PK is used at twice
- run the query you want to run without distinct and see if you get 2 or 4 records.
If you get back 2, then you can safely go ahead and remove DISTINCT from the query when using AWS.