presto-python-client icon indicating copy to clipboard operation
presto-python-client copied to clipboard

Provide convenient way for loading query results into a data frame

Open kebab-mai-haddi opened this issue 5 years ago • 16 comments

I want to create a data frame from hive using Presto. I am done with it but with one exception: there are empty strings in my data which would be NaN if the corresponding CSV file is read using Pandas (pd.read_csv()).

I read several documentations but none address this issue.

My code for creating a data frame from Presto is pretty straightforward:

def get_pandas_dataframe(self, hql, parameters=None):
        if not self.airflow_conn_reqd:
            import pandas
            cursor = self.presto_client.cursor()
            try:
                cursor.execute(self._strip_sql(hql), parameters)
                data = cursor.fetchall()
            except DatabaseError as e:
                raise PrestoException(self._get_pretty_exception_message(e))
            column_descriptions = cursor.description
            if data:
                df = pandas.DataFrame(data)
                df.columns = [c[0] for c in column_descriptions]
            else:
                df = pandas.DataFrame()
            return df
        else:
            return(
                self.get_pandas_df(query)
            )

As you can see, there exists a line df = pandas.DataFrame(data). Now, I read the documentation of pandas.DataFrame() and could not find any method of keeping keep_default_na to True so that the empty string '' are treated as NaN as Pandas would treat them.

kebab-mai-haddi avatar Jun 17 '19 11:06 kebab-mai-haddi

@avisrivastava254084 can you describe the schema of the tables you are reading and provide an example with rows from this table? In other words, could you describe the data that you're querying from Presto and the expected values in the pandas Dataframe.

ggreg avatar Jun 25 '19 03:06 ggreg

@ggreg sorry for responding late, can we close this issue and instead discuss the main topic "Creating pandas data frame in presto client" instead? This issue

kebab-mai-haddi avatar Jun 26 '19 21:06 kebab-mai-haddi

This is the repository for the Presto Python client. The other issue is in the Presto engine codebase.

ggreg avatar Jun 26 '19 21:06 ggreg

@avisrivastava254084 Aviral, prestodb/presto-python-client (this repo) is a proper place for Python client enhancements. The other repo, prestodb/presto, contains code for the Presto itself. Are you thinking of making changes in the presto client or presto code?

mbasmanova avatar Jun 26 '19 21:06 mbasmanova

I am so sorry, I am thinking in the presto client only. Let's continue here.

So, do you guys think we should have pandas data frame creation in here as well? I could copy the code from PrestoHook of Airflow because I use both. But I think it'd be easier for presto users who code in Python if they are given this feature, creating pandas data frame using Presto.

Does this make sense?

On Thu, Jun 27, 2019 at 2:56 AM Maria Basmanova [email protected] wrote:

@avisrivastava254084 https://github.com/avisrivastava254084 Aviral, prestodb/presto-python-client (this repo) is a proper place for Python client enhancements. The other repo, prestodb/presto, contains code for the Presto itself. Are you thinking of making changes in the presto client or presto code?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto-python-client/issues/83?email_source=notifications&email_token=ADG64L3VHIWV3LHXBGXL4JDP4PNIZA5CNFSM4HYVXJVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYU3XKI#issuecomment-506051497, or mute the thread https://github.com/notifications/unsubscribe-auth/ADG64LYQ6X7PPAFP27MMUQ3P4PNIZANCNFSM4HYVXJVA .

kebab-mai-haddi avatar Jun 26 '19 22:06 kebab-mai-haddi

@avisrivastava254084 Makes sense to me.

mbasmanova avatar Jun 26 '19 23:06 mbasmanova

I'd like to avoid introducing a dependency to pandas. I was thinking of providing an interface to return the data in the right format.

ggreg avatar Jun 26 '19 23:06 ggreg

@ggreg I am telling you from my own experience, this library is in use and giving the data scientists the ability to create data frames is needed. Imagine being a data scientist and having rightful abstractions over S3(where data is stored), Hive(where meta is stored) and SQL(Presto, which is an engine).

Not only Pandas, we should also think of implementing the same for pyspark i.e. creating Spark data frame from Presto.

But I would like your input on this and your thought process, maybe I am having a bias.

kebab-mai-haddi avatar Jun 27 '19 09:06 kebab-mai-haddi

@avisrivastava254084 i appreciate the value of support this. My main concern is to not conflate too many things into the Presto Python client library.

What do you think about providing these features in another library, called presto-python-common (i welcome better names too :) )?

ggreg avatar Jul 18 '19 22:07 ggreg

Name sounds good. Shall I begin?

On Fri, Jul 19, 2019, 04:01 Greg [email protected] wrote:

@avisrivastava254084 https://github.com/avisrivastava254084 i appreciate the value of support this. My main concern is to not conflate too many things into the Presto Python client library.

What do you think about providing these features in another library, called presto-python-common (i welcome better names too :) )?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto-python-client/issues/83?email_source=notifications&email_token=ADG64LZMS6SRFKZ5TUPJTXTQADVL7A5CNFSM4HYVXJVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2J7OOY#issuecomment-513013563, or mute the thread https://github.com/notifications/unsubscribe-auth/ADG64L7E4W6Y2PGTOODLLYDQADVL7ANCNFSM4HYVXJVA .

kebab-mai-haddi avatar Jul 19 '19 16:07 kebab-mai-haddi

I'm preparing the repo so it follows the requirements for an open source repository and allow to easily contribute new modules. I'll ping when it is ready.

Meanwhile how would we call the module to manipulate dataframes? Should we just call it prestodb.dataframe and provide a get(sql, **kwargs) -> pandas.DataFrame function?

ggreg avatar Jul 22 '19 23:07 ggreg

Looks good to me.

On Tue, Jul 23, 2019, 05:24 Greg [email protected] wrote:

I'm preparing the repo so it follows the requirements for an open source repository and allow to easily contribute new modules. I'll ping when it is ready.

Meanwhile how would we call the module to manipulate dataframes? Should we just call it prestodb.dataframe and provide a get(sql, **kwargs) -> pandas.DataFrame function?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto-python-client/issues/83?email_source=notifications&email_token=ADG64L3JCZN2JQNXVUWVI3DQAZCERA5CNFSM4HYVXJVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2RPZAI#issuecomment-513997953, or mute the thread https://github.com/notifications/unsubscribe-auth/ADG64L2VEQLMZ5QTJD37CBDQAZCERANCNFSM4HYVXJVA .

kebab-mai-haddi avatar Jul 23 '19 00:07 kebab-mai-haddi

Is there any progress on this? It would be incredibly helpful for my use cases.

adeora avatar Feb 21 '20 16:02 adeora

@ggreg any update? I am still up for this.

kebab-mai-haddi avatar Feb 21 '20 16:02 kebab-mai-haddi

@ggreg no longer works on the project. @mayankgarg1990 Do you know what might help here?

mbasmanova avatar Feb 21 '20 17:02 mbasmanova

Thanks @mbasmanova !

@mayankgarg1990 As you can see, there is clearly a need for this. Let me know if you guys need some context and if time is an issue, we can fasten this up on a call or something.

hi at aviralsrivastava dot com

kebab-mai-haddi avatar Feb 21 '20 17:02 kebab-mai-haddi