easyaccess icon indicating copy to clipboard operation
easyaccess copied to clipboard

Uploading tables with string columns to the database in Python 3

Open bechtol opened this issue 5 years ago • 0 comments

Python 3 treats all strings as unicode (link), which involves a different number of bytes compared to ascii strings. This causes problems when trying to upload tables with string columns (such as TILENAME) into a VARCHAR2 column in the database, leading to errors like

ORA-12899: value too large for column "BECHTOL"."Y6A1_EXTINCTION"."TILENAME" (actual: 24, maximum: 12)

Here's a more detailed illustration of the issue

import numpy as np
from astropy.table import Table
import easyaccess as ea

print('easyaccess version:', ea.__version__)
print('python version:', sys.version)

query = '''
SELECT tilename 
FROM y6a1_coadd_object_summary
WHERE rownum < 10;
'''

try:
    print('=== Stay within pandas ===')
    connection = ea.connect()
    df_pandas = connection.query_to_pandas(query)
    connection.pandas_to_db(df_pandas, tablename='ea_test_pandas', append=False)
    connection.describe_table('ea_test_pandas') 
    connection.close()
except Exception as e: 
    print(e)
    print('pandas upload error')
    connection.close()
    pass

fileout = 'easyaccess_test.fits'

try:
    print('=== Save to file and then upload directly ===')
    connection = ea.connect()
    connection.query_and_save(query, fileout)
    connection.load_table(fileout, name='ea_test_fileio', memsize=100)
    connection.describe_table('ea_test_fileio')
    connection.close()
except Exception as e: 
    print(e)
    print('file upload error')
    connection.close()
    pass

try:
    print('=== Save to file and then upload via pandas ===')
    data = Table.read(fileout, format='fits')
    df_multistep = data.to_pandas()
    connection = ea.connect()
    connection.pandas_to_db(df_multistep, tablename='ea_test_multistep', append=False)
    connection.describe_table('ea_test_multistep') 
    connection.close()
except Exception as e: 
    print(e)
    print('fileio to pandas upload error')
    connection.close()
    pass

try:
    print('=== Save to file, convert encoding, then upload via pandas ===')
    data = Table.read(fileout, format='fits')
    df_convert = data.to_pandas()
    df_convert['TILENAME'] = np.array(list(map(lambda s: s.decode('ascii'), df_convert['TILENAME'])), dtype=object)
    connection = ea.connect()
    connection.pandas_to_db(df_convert, tablename='ea_test_convert', append=False)
    connection.describe_table('ea_test_convert') 
    connection.close()
except Exception as e: 
    print(e)
    print('convert upload error')
    connection.close()
    pass

This is simple test that downloads the TILENAME from the first 10 rows of the table and then tries to upload to a table in different ways. Among these four examples, on the first and fourth currently work with python 3.

Here’s the critical line that converts the encoding in the dataframe. It’s a little bit clumsy because there isn’t a vectorized function for the encoding conversion.

df_convert['TILENAME'] = np.array(list(map(lambda s: s.decode('ascii'), df_convert['TILENAME'])), dtype=object)

Note that the datatype for the converted column must be object and not a string. I’ve also checked that using NVARCHAR2 instead of VARCHAR2 as the type in the database does not work (and anyway, we probably don’t want unicode columns in the database).

I’m not sure what is the best place to do this in the code. All the table uploads seem to be using the insert_data function, but the conversion in the pandas DataFrame actually would need to come before that function is called, so it seems the fix would have to be in multiple places.

bechtol avatar Oct 22 '19 22:10 bechtol