duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

invalid unicode in segments statistics

Open verajosemanuel opened this issue 3 years ago • 19 comments

Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:

Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

My guess is the dataframe is too wide (635 columns) or too long 500.000 rows

Windows 10. duckdb jdbc connection 0.24 driver, Rstudio

verajosemanuel avatar Apr 19 '21 15:04 verajosemanuel

Unrelated question, why are you using the JDBC driver with RStudio? We have a dedicated R package....

hannes avatar Apr 19 '21 19:04 hannes

database client using jdbc. Rpackage for code. I did not explain myself properly. is this size related? I need to load all my data in one table. Maybe if I split it in different files....

verajosemanuel avatar Apr 20 '21 06:04 verajosemanuel

No, the issue should certainly not be size related. But it would help us a lot if you could try to create a minimal reproducible example that demonstrates the issue.

hannes avatar Apr 20 '21 07:04 hannes

Sadly , data is protected. Just any data file with more than 600 columns and 500.000 rows I assume will give the same result. I will try to generate such file.

verajosemanuel avatar Apr 21 '21 21:04 verajosemanuel

Great thanks. Otherwise there is little we can do.

hannes avatar Apr 22 '21 07:04 hannes

Same issue here. What are the possibile issues in this case? (Invalid unicode detected in segment statistics update!) Is it related to encoding? I may have latin1 chars in my data.

albersonmiranda avatar Nov 16 '21 20:11 albersonmiranda

What API are you using and how are you loading data into the database? Could you make a reproducible example of the problem? This error is always a bug as the system should convert all data to UTF8 when ingesting the data. Perhaps one of the ingestion methods does not convert or verify correctly.

Mytherin avatar Nov 16 '21 20:11 Mytherin

R API. Just tried with a small subset (100 lines) and it worked. The whole thing is 4.2gb so I'm not quite sure how to debug it. Will try some more sampling :|

albersonmiranda avatar Nov 16 '21 21:11 albersonmiranda

Alright, so I just subset my data until I pinpoint an entry that got this "SN\xaa" char. Is that a known bug?

albersonmiranda avatar Nov 17 '21 16:11 albersonmiranda

That is not a known bug. Could you please file a bug report? Thanks!

Mytherin avatar Nov 17 '21 16:11 Mytherin

@Mytherin , I'm having the same issue with the file below "test.csv": test.csv

db_con <- DBI::dbConnect(duckdb::duckdb(), dbdir="./test.duckdb", read_only=FALSE)
data_from_csv <- data.table::fread("./test.csv"), encoding = "UTF-8")
DBI::dbWriteTable(db_con, "test", data_from_csv, overwrite=TRUE) 

which throws this error message: Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

I believe the problem is with the UTF-8 encoding because when I import the .csv without the encoding option, it works (however the special character from my language are lost).

GitHunter0 avatar Jan 10 '22 23:01 GitHunter0

I also encountered this while using connection.from_df(df).create('table_name') and CREATE TABLE on a large dataset with various country names. Unfortunately, reproducibility is tricky and I haven't found a good shareable data file, but it seems like this happens more often with low memory set by PRAGMA memory_limit (if I set it to 1GB it always happens).

victorlin avatar Jan 11 '22 07:01 victorlin

@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release.

hannes avatar Jan 11 '22 07:01 hannes

You're welcome @hannesmuehleisen . Would you consider reopen this issue?

GitHunter0 avatar Jan 11 '22 13:01 GitHunter0

I've had this issue now, too. I used the R package and dbplyr. It was a join resulting in a biggish table (4 billion records, 2 columns). I was able to use tally to find that out (as I wanted to find out pulling it into R is feasible), but not compute to create the result. That's of course the range of records where duckdb'd be handy to use for as R user. Disk.frame did the job.

KnutJaegersberg avatar May 19 '22 05:05 KnutJaegersberg

I'm having the same issue using duckdb 0.5.0, R 4.2.1 running on Windows10. I'm using open source files from CMS downloadable at CMS medicare data site (click the "download" option and just need one file to reproduce- example is with 2020 the most recent file).

Each CSV file contains a table with ~1M rows; example is 1,161,542 rows and 73 columns- with doctor information, drug prescribed and other details (each year in a different file; using only 2020 for illustration- occurred with earlier years too). I'm able to write the table only if subset to <200,000 rows. It also fails at a similar limit on other files. Clarification: full data set reads into memory using readr::read_csv just fine, but subset to find where the error during the duckdb::dbWriteTable step occurs. I cannot identify any "hidden" unicode characters in the referenced field.

# reproducible example
cms_provider_files <- list.files(path = here::here("data_by_provider/csv_files_by_year/"), pattern = "*.csv", full.names = T)

con = dbConnect(duckdb::duckdb(), 
                dbdir=here::here("data_by_provider/db_byprovider/providerdb.duckdb"))
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 100000) # total rows = 1161542, cols 73
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 150000) # works fine
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 200000) # ERROR
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf) # ERROR
duckdb::dbWriteTable(con, "test_data", df_2020, overwrite = T) 
# Error: rapi_execute: Failed to run query
# Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

note: I'm using the readr package to parse the files because the duck_read_csv errors as well. Not sure if related because it occurs on an earlier row (example below). I thought the readr method would get around this, but not so.

duckdb_read_csv(con, "test_df2020", cms_provider_files[8])
# Error: rapi_execute: Failed to run query
# Error: Invalid Input Error: Could not convert string 'K1H 8' to INT32 between line 56321 and 57345 in column 12. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0 

Session Info:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dbplyr_2.2.1    duckdb_0.5.0    DBI_1.1.3       forcats_0.5.2   stringr_1.4.1   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
[10] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2

JMLuther avatar Sep 07 '22 16:09 JMLuther

I can get around this error by parsing data into memory via readr::read_csv and explicit column format (using col_type=list(...)), and then passing each result to duckdb::dbWriteTable(...)

I've been able to read the first 7 files of 24M each using this method, but do get an error with the latest 2020 file still... (additional file set at CMS.gov...which is my intended, more detailed/larger data set, rather than the first one I listed above). I believe data is too large to read into memory all files at once.

I've used this code which works, except for the last file (2020) which throws the unicode error (temp df is just file names and year):

walk2(temp$cms_provider_files, temp$year,
        ~{
          readr::read_csv(.x, n_max = Inf, col_types = cms_providerdb_spec) %>% 
            mutate(year = .y) %>% 
            duckdb::dbWriteTable(conn = con_provider, name =  paste0("cms_providerdb_", .y), value = .,  overwrite = T)
        })

Tried duckdb_read_csv() to read all, but it is not letting me pass the column specification via colClasses argument, and I also want to add a column indicating the year.

The 2020 data file (MUP_DPR_RY22_P04_V10_DY20_NPIBN_0.csv) reads into memory without issue- same error occurs when trying to append to duckdb somewwere between row 20,500,000 - 20,600,000 (saves ok for row 20.6M:end).

df_puf2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf, col_types = cms_providerdb_spec) 
nrow(df_puf2020)
# [1] 25209729
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[1:20500000, ],  overwrite = T) # OK
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20600000:25209729, ],  overwrite = F, append = T) # OK
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20500001:20599999, ],  overwrite = F, append = T) # Error
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20500001:20550001, ],  overwrite = F, append = T) # OK

JMLuther avatar Sep 10 '22 18:09 JMLuther

I found the culprit in my data. The issue is clearly with non-recognized characters in the data file. I identified the location of "non-ascii" characters in the rows that throw the error (20,500,000 - 20,600,000) using this code, and then fixed it. Once that single value was fixed, I was able to save to the duckdb database.

map(df_puf2020, ~which(!stringi::stri_enc_isascii(.x)))
# $Prscrbr_City
df_puf2020$Prscrbr_City[20565445] # "Mayag\xe3\x9cez"
df_puf2020$Prscrbr_City[20565445] <- "Mayaguez"

Interestingly, there were a number of non-ascii characters in sections that did not cause an error, and did not require "fixing" for the database to save. Here are those just in case it helps in some way:

# $Prscrbr_Last_Org_Name
bad_ascii_lastname <- c(22057728, 23970569, 23970570, 23970571, 23970572, 23970573)
df_puf2020$Prscrbr_Last_Org_Name[bad_ascii_lastname]
# [1] "Pe�a-Alvarado"   "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez"

# $Prscrbr_First_Name
bad_ascii_firstname <- c(10254424, 10254425,  10254426,  10254427,  10254428,  10254429,  10254430,  10254431,  10254432,  10254433,  10254434,  
                         10254435,  10254436,  10254437,  10254438,  10254439,  10254440,  10254441,  10254442, 10254443,  10254444,  10254445,  
                         10254446,  10254447,  10254448,  10254449,  10254450,  10254451,  10254452,  10254453,  10254454,  10254455,  10254456, 
                         10254457,  10254458,  10254459,  10254460,  10254461)
df_puf2020$Prscrbr_First_Name[bad_ascii_firstname] <- "Concepcion" # "Concepci�n" 
# [1] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [14] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [27] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"

JMLuther avatar Sep 12 '22 02:09 JMLuther

It would seem we need to double-check the UTF verification for R data frame writing

hannes avatar Sep 12 '22 04:09 hannes

hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value

Stongtong avatar Oct 13 '22 09:10 Stongtong

Could you share the Parquet file with us (e.g. send to my email at [email protected])?

Mytherin avatar Oct 13 '22 09:10 Mytherin

sorry, the file data is secret, I write parquet file useing spark fix this exception. thanks

Stongtong avatar Oct 17 '22 08:10 Stongtong

We've had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven't got a minimal reproducible example (because, being on mac, i've never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex

https://github.com/moj-analytical-services/splink/issues/664

RobinL avatar Oct 21 '22 19:10 RobinL

How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?

Mytherin avatar Oct 22 '22 06:10 Mytherin

How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?

Via Pandas dataframes

RobinL avatar Oct 22 '22 06:10 RobinL

It is likely the unicode ingestion verification there is missing or incomplete - we can have a look to see if we can create a reproducible example by intentionally creating pandas data frames with bogus unicode data.

Mytherin avatar Oct 22 '22 06:10 Mytherin

Thank you. I will also try to find time to see if I can recreate the problem. I've also asked another of our users whether they may be able to share their problematic data.

RobinL avatar Oct 22 '22 06:10 RobinL

FWIW, I've tried to reproduce the error on a Windows 11 machine by creating an input csv populated with a variety of values from the big list of naughty strings.

I've been unable to reproduce the error. I think the 'problem' (i.e. the reason it is not causing an error) is probably that my csv file, whilst containing various unicode characters, is correctly encoded in utf-8. But I've experimented by incorrectly saving unicode characters in Windows-1252 encoding as well, with no luck in reproducing the problem either.

My best guess at the moment is that this is something to do with Microsoft SQL Server, which (I believe), usually stores data in Windows-1252 (cp1252) encoding, and how this data gets loaded into Pandas. But this is pure guesswork based on encoding errors I've run into in the past interacting with MS SQL Server, and the fact that all of our users who've seen this error so far have sourced the data from MS SQL Server.

RobinL avatar Oct 22 '22 17:10 RobinL

Getting this error in Python when performing a join on two already existing duckdb tables. Have tried in both windows 10, and on a Mac M1. The error happens whenever trying to join specific tables.

The original tables were read from csv files specifying utf-8, with no errors.

Will try and provide two reasonable files to reproduce, but might be hard to narrow down as both tables are rather large (>25mil rows) and from reading the previous, it seems that the problem might be isolated to a single row somewhere?

---------------------------------------------------------------------------
InternalException                         Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 con.execute("CREATE TABLE med_works_gender AS \
      2             SELECT med_works_authors.id as work_id, \
      3             med_works_authors.author_id, \
      4             authors_gender.ga_gender as author_gender \
      5             FROM med_works_authors \
      6             LEFT JOIN authors_gender \
      7             ON med_works_authors.author_id = authors_gender.id")
      9 con.execute("SELECT * FROM med_works_gender LIMIT 5").df()

InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!

joxang avatar Oct 24 '22 23:10 joxang

Sorry you are running into this!

Maybe you could narrow it down with a binary search using limit and offset?

Alex-Monahan avatar Oct 25 '22 01:10 Alex-Monahan