datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Oracle ingestion -> Not ingesting system table space

Open jjoyce0510 opened this issue 2 years ago • 2 comments

Describe the bug

Reported originally by @atulsaurav. When ingesting from Oracle, we do not ingest any system table space tables. This should at least be

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

jjoyce0510 avatar Jun 14 '22 16:06 jjoyce0510

Here is some additional information There is an example from SQLAlchemy version used in datahub -

The get_table_names function does a few things -

  • if you look at the query in this function, it queries ALL_TABLES. This would only pull tables names that the id has select privileges on. Usually it is harder to get a master id created that has access to all data in all tables. The other option is to request catalog access on oracle databases, where in the id doesnot has SELECT privileges on any tables except oracle catalog tables. however in this case pulling table list from ALL_TABLES is not helpful. However with catalog privileges, DBA_TABLES does present a listing of all tables on the database. So right off the bat, not all tables are pulled by simply using this method. We patched this on our side by replacing ALL_* tables with DBA_* tables. That fixed part of the problem.
  • The second level of exclusions result from the filter criteria in the query used in this function -
    sql_str = "SELECT table_name FROM dba_tables WHERE "
        if self.exclude_tablespaces:
            sql_str += (
                "nvl(tablespace_name, 'no tablespace') "
                "NOT IN (%s) AND "
                % (", ".join(["'%s'" % ts for ts in self.exclude_tablespaces]))
            )
        sql_str += (
            "OWNER = :owner "
            "AND IOT_NAME IS NULL "
            "AND DURATION IS NOT NULL"
        )

The excluded_tablespaces and DURATION above filters out tables in SYS and SYSAUX tablespaces, and temporary tables respectively.

Excluded tablespaces can be overridden by specifying additional connection parameters in the recipe (maybe a good idea to document it) but temp tables may need a patch.

Anyhow, it may still be a good idea to document the current state so users are aware of scenarios in which they may be missing tables in the catalog.

I am happy to discuss more on this.. feel free to slack me.

atulsaurav avatar Jun 22 '22 21:06 atulsaurav

Hi @atulsaurav I believe the behavior of get_table_names is not a bug in DataHub. You may need to contact the SqlAlchemy team and specifically, the oracle driver team who has written the function in such a manner. They might have used ALL_TABLES instead of DBA_TABLES for security concerns.

You can control excluded_tablespaces from the oracle source recipe. Please find the sample oracle source recipe with options

source:
  type: oracle
  config:
    # Coordinates
    host_port: <host>:<port>
    database: <database>

    # Credentials
    username: <user-name>
    password: <password>
    options:
        exclude_tablespaces: ["SYSAUX"]

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080" 

Please let me know if you face any issues.

siddiquebagwan avatar Aug 02 '22 13:08 siddiquebagwan

Hey people, wouldn't be necessary to override the behavior of the other method to use the DBA_* tables too? E.g: get_columns (DBA_TAB_COLUMNS), get_table_comment (DBA_TAB_COMMENTS), etc.

I'm testing the oracle ingestion with a user with only metadata privilege (tables DBA_*). The ingestion works but the table's columns are not ingested. I think it is because these method are still reading metadata using the ALL_* tables (from sqlalchemy), that only return data from tables where the user can read from, and our oracle user can't (ideally we don't want to grant select on tables to the user, because we're only interested in metadata).

pdraguiar avatar Sep 06 '22 17:09 pdraguiar

Fixed by PR: https://github.com/datahub-project/datahub/pull/5592. Closing.

rslanka avatar Sep 09 '22 00:09 rslanka