datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Connecting AWS keyspaces cassandra

Open Priyabrata409 opened this issue 3 years ago • 7 comments

How to connect to aws keyspace cassandra as it asks for SSL certificate and service's user name and password ? How to pass it in MinHashLSH's constructor. The way to connect to aws cassandra using python is ` from cassandra.cluster import Cluster from ssl import SSLContext, PROTOCOL_TLSv1_2 , CERT_REQUIRED from cassandra.auth import PlainTextAuthProvider

ssl_context = SSLContext(PROTOCOL_TLSv1_2 ) ssl_context.load_verify_locations('path_to_file/sf-class2-root.crt') ssl_context.verify_mode = CERT_REQUIRED auth_provider = PlainTextAuthProvider(username='ServiceUserName', password='ServicePassword') cluster = Cluster(['cassandra.us-east-2.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142) session = cluster.connect() r = session.execute('select * from system_schema.keyspaces') print(r.current_rows)`

Priyabrata409 avatar Feb 18 '22 09:02 Priyabrata409

Have you tried passing these as part of the connection configs in cassandra? http://ekzhu.com/datasketch/lsh.html#connecting-to-existing-minhash-lsh

ekzhu avatar Jun 02 '22 19:06 ekzhu

Yes, I have tried but it doesn’t accept the parameters required to connect with Cassandra. You could have a look on the parameters . I am currently able to connect to the local Cassandra but when it comes to AWS key space it is failing

Priyabrata409 avatar Jun 02 '22 19:06 Priyabrata409

I haven't used AWS Cassandra. @ostefano do you have experience with this?

ekzhu avatar Jun 07 '22 05:06 ekzhu

No experience with Cassandra AWS either unfortunately

ostefano avatar Jun 07 '22 09:06 ostefano

It is possible to connect to AWS keyspaces by slightly tweaking the kwargs and the get_session() method in CassandraSharedSession. However, AWS keyspaces does not yet support SELECT DISTINCT query needed for QUERY_GET_KEYS. I have provided code below to demonstrate. Perhaps there is a way to rewrite the query to get around this constraint Screenshot 2024-02-05 at 2 35 40 PM

Call algorithm with AWS keyspaces

lsh = MinHashLSH(
    threshold=0.5, num_perm=128, storage_config={
        'type': 'cassandra',
        'basename': b'testing',
        'cassandra': {
            'seeds': ['cassandra.us-west-2.amazonaws.com'],
            'keyspace': 'tutorialkeyspace',
            'ssl_context': {ssl_context},
            'auth_provider': {auth_provider},
            'port': {port},
            'replication': {
                'class': 'SimpleStrategy',
                'replication_factor': '3',
            },
            'drop_keyspace': False,
            'drop_tables': False,
        }
    }
)```
Adjust Cluster instantiation for AWS kwargs
   def get_session(cls, seeds, **kwargs):
        _ = kwargs
        keyspace = kwargs["keyspace"]
        replication = kwargs["replication"]

        if cls.__session is None and kwargs['ssl_context'] is None:
            # Allow dependency injection
            session = kwargs.get("session")
            if session is None:
                cluster = c_cluster.Cluster(seeds)
                session = cluster.connect()
            cls.__session = session

        if cls.__session is None and kwargs['ssl_context'] is not None:
            # Allow dependency injection
            session = kwargs.get("session")
            if session is None:
                cluster = c_cluster.Cluster(seeds, ssl_context=kwargs["ssl_context"], auth_provider=kwargs["auth_provider"], port=9142)
                # cluster = c_cluster.Cluster(seeds)
                session = cluster.connect()
            cls.__session = session

        if cls.__session.keyspace != keyspace:
            if kwargs.get("drop_keyspace", False):
                cls.__session.execute(cls.QUERY_DROP_KEYSPACE.format(keyspace))
            cls.__session.execute(cls.QUERY_CREATE_KEYSPACE.format(
                keyspace=keyspace,
                replication=str(replication),
            ))
            cls.__session.set_keyspace(keyspace)
        return cls.__session

alexalbracht-firstparty avatar Feb 05 '24 19:02 alexalbracht-firstparty

@alexalbracht-firstparty thanks! Would you like to submit a PR to address this?

ekzhu avatar Feb 06 '24 10:02 ekzhu

That specific query is only used once, and only to get all keys using the TOKEN function (special case).

Now, whether you can add a switch and handle AWS differently boils down to the following test:

  1. create a table with a PK containing a CK
  2. insert 4 records so that the same PK is used at twice
  3. run the query you want to run without distinct and see if you get 2 or 4 records.

If you get back 2, then you can safely go ahead and remove DISTINCT from the query when using AWS.

ostefano avatar Feb 06 '24 10:02 ostefano