evadb icon indicating copy to clipboard operation
evadb copied to clipboard

Unstructured data to structured data conversion via `EXTRACT_COLUMN`

Open hershd23 opened this issue 1 year ago • 13 comments

Added custom function for extracting columns from unstructured data new file: ../evadb/functions/extract_columns.py

hershd23 avatar Nov 03 '23 08:11 hershd23

@xzdandy I created a python notebook as well but it gets gitignored while the rest of the tutorial notebooks don't any idea?

hershd23 avatar Nov 03 '23 08:11 hershd23

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Solves #1235

hershd23 avatar Nov 03 '23 08:11 hershd23

@xzdandy @pchunduri6 moved to a "one-column-at-a-time" implementation as you recommended.

The notebook has the implementation

hershd23 avatar Nov 28 '23 02:11 hershd23

For one column at a time I think this PR is ready for review @xzdandy @pchunduri6.

For the other changes discussed with either of you, I think it makes sense to take that up in a separate PR else this will bloat. Let me know what you think

hershd23 avatar Nov 29 '23 04:11 hershd23

Can we also add a long integration test for the function under https://github.com/georgia-tech-db/evadb/tree/staging/test/integration_tests/long/functions? We can skip the test in circle ci due to openai key, but I think it is good to have one.

It can either be end-to-end (i.e., SQL queries) or directly test the function class.

xzdandy avatar Nov 29 '23 09:11 xzdandy

Yes @xzdandy on it

hershd23 avatar Nov 29 '23 21:11 hershd23

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

hershd23 avatar Nov 30 '23 07:11 hershd23

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

Remove the last empty cell.

xzdandy avatar Dec 01 '23 07:12 xzdandy

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

hershd23 avatar Dec 01 '23 22:12 hershd23

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

The current notebook actually does not work on the colab. I was trying to make it work yesterday and I think it needs several modifications. One fix can help is that can you add the EXTRACT_COLUMN to bootstrap functions in https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/function_bootstrap_queries.py

xzdandy avatar Dec 02 '23 06:12 xzdandy

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust. @hershd23 @xzdandy Any thoughts?

pchunduri6 avatar Dec 04 '23 15:12 pchunduri6

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust. @hershd23 @xzdandy Any thoughts?

Hi @pchunduri6, I think it depends on the task. If the extract column is based on patterns, I think we can generate regex for saving the cost and improve efficiency. On the other hand, if the task is semantic based, we need to rely on the LLM to extract the information.

xzdandy avatar Dec 04 '23 20:12 xzdandy