duckdb
duckdb copied to clipboard
invalid unicode in segments statistics
Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:
Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!
My guess is the dataframe is too wide (635 columns) or too long 500.000 rows
Windows 10. duckdb jdbc connection 0.24 driver, Rstudio
Unrelated question, why are you using the JDBC driver with RStudio? We have a dedicated R package....
database client using jdbc. Rpackage for code. I did not explain myself properly. is this size related? I need to load all my data in one table. Maybe if I split it in different files....
No, the issue should certainly not be size related. But it would help us a lot if you could try to create a minimal reproducible example that demonstrates the issue.
Sadly , data is protected. Just any data file with more than 600 columns and 500.000 rows I assume will give the same result. I will try to generate such file.
Great thanks. Otherwise there is little we can do.
Same issue here. What are the possibile issues in this case? (Invalid unicode detected in segment statistics update!
) Is it related to encoding? I may have latin1 chars in my data.
What API are you using and how are you loading data into the database? Could you make a reproducible example of the problem? This error is always a bug as the system should convert all data to UTF8 when ingesting the data. Perhaps one of the ingestion methods does not convert or verify correctly.
R API. Just tried with a small subset (100 lines) and it worked. The whole thing is 4.2gb so I'm not quite sure how to debug it. Will try some more sampling :|
Alright, so I just subset my data until I pinpoint an entry that got this "SN\xaa" char. Is that a known bug?
That is not a known bug. Could you please file a bug report? Thanks!
@Mytherin , I'm having the same issue with the file below "test.csv": test.csv
db_con <- DBI::dbConnect(duckdb::duckdb(), dbdir="./test.duckdb", read_only=FALSE)
data_from_csv <- data.table::fread("./test.csv"), encoding = "UTF-8")
DBI::dbWriteTable(db_con, "test", data_from_csv, overwrite=TRUE)
which throws this error message: Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!
I believe the problem is with the UTF-8 encoding because when I import the .csv without the encoding option, it works (however the special character from my language are lost).
I also encountered this while using connection.from_df(df).create('table_name')
and CREATE TABLE
on a large dataset with various country names. Unfortunately, reproducibility is tricky and I haven't found a good shareable data file, but it seems like this happens more often with low memory set by PRAGMA memory_limit
(if I set it to 1GB
it always happens).
@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release.
You're welcome @hannesmuehleisen . Would you consider reopen this issue?
I've had this issue now, too. I used the R package and dbplyr. It was a join resulting in a biggish table (4 billion records, 2 columns). I was able to use tally to find that out (as I wanted to find out pulling it into R is feasible), but not compute to create the result. That's of course the range of records where duckdb'd be handy to use for as R user. Disk.frame did the job.
I'm having the same issue using duckdb
0.5.0, R 4.2.1 running on Windows10.
I'm using open source files from CMS downloadable at CMS medicare data site (click the "download" option and just need one file to reproduce- example is with 2020 the most recent file).
Each CSV file contains a table with ~1M rows; example is 1,161,542 rows and 73 columns- with doctor information, drug prescribed and other details (each year in a different file; using only 2020 for illustration- occurred with earlier years too).
I'm able to write the table only if subset to <200,000 rows. It also fails at a similar limit on other files.
Clarification: full data set reads into memory using readr::read_csv
just fine, but subset to find where the error during the duckdb::dbWriteTable
step occurs.
I cannot identify any "hidden" unicode characters in the referenced field.
# reproducible example
cms_provider_files <- list.files(path = here::here("data_by_provider/csv_files_by_year/"), pattern = "*.csv", full.names = T)
con = dbConnect(duckdb::duckdb(),
dbdir=here::here("data_by_provider/db_byprovider/providerdb.duckdb"))
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 100000) # total rows = 1161542, cols 73
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 150000) # works fine
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 200000) # ERROR
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf) # ERROR
duckdb::dbWriteTable(con, "test_data", df_2020, overwrite = T)
# Error: rapi_execute: Failed to run query
# Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!
note: I'm using the readr
package to parse the files because the duck_read_csv
errors as well. Not sure if related because it occurs on an earlier row (example below). I thought the readr
method would get around this, but not so.
duckdb_read_csv(con, "test_df2020", cms_provider_files[8])
# Error: rapi_execute: Failed to run query
# Error: Invalid Input Error: Could not convert string 'K1H 8' to INT32 between line 56321 and 57345 in column 12. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
Session Info:
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dbplyr_2.2.1 duckdb_0.5.0 DBI_1.1.3 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[10] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
I can get around this error by parsing data into memory via readr::read_csv
and explicit column format (using col_type=list(...)
), and then passing each result to duckdb::dbWriteTable(...)
I've been able to read the first 7 files of 24M each using this method, but do get an error with the latest 2020 file still... (additional file set at CMS.gov...which is my intended, more detailed/larger data set, rather than the first one I listed above). I believe data is too large to read into memory all files at once.
I've used this code which works, except for the last file (2020) which throws the unicode error (temp df is just file names and year):
walk2(temp$cms_provider_files, temp$year,
~{
readr::read_csv(.x, n_max = Inf, col_types = cms_providerdb_spec) %>%
mutate(year = .y) %>%
duckdb::dbWriteTable(conn = con_provider, name = paste0("cms_providerdb_", .y), value = ., overwrite = T)
})
Tried duckdb_read_csv()
to read all, but it is not letting me pass the column specification via colClasses
argument, and I also want to add a column indicating the year.
The 2020 data file (MUP_DPR_RY22_P04_V10_DY20_NPIBN_0.csv) reads into memory without issue- same error occurs when trying to append to duckdb somewwere between row 20,500,000 - 20,600,000 (saves ok for row 20.6M:end).
df_puf2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf, col_types = cms_providerdb_spec)
nrow(df_puf2020)
# [1] 25209729
duckdb::dbWriteTable(conn = con_provider, name = "cms_providerdb_2020", value = df_puf2020[1:20500000, ], overwrite = T) # OK
duckdb::dbWriteTable(conn = con_provider, name = "cms_providerdb_2020", value = df_puf2020[20600000:25209729, ], overwrite = F, append = T) # OK
duckdb::dbWriteTable(conn = con_provider, name = "cms_providerdb_2020", value = df_puf2020[20500001:20599999, ], overwrite = F, append = T) # Error
duckdb::dbWriteTable(conn = con_provider, name = "cms_providerdb_2020", value = df_puf2020[20500001:20550001, ], overwrite = F, append = T) # OK
I found the culprit in my data. The issue is clearly with non-recognized characters in the data file. I identified the location of "non-ascii" characters in the rows that throw the error (20,500,000 - 20,600,000) using this code, and then fixed it. Once that single value was fixed, I was able to save to the duckdb database.
map(df_puf2020, ~which(!stringi::stri_enc_isascii(.x)))
# $Prscrbr_City
df_puf2020$Prscrbr_City[20565445] # "Mayag\xe3\x9cez"
df_puf2020$Prscrbr_City[20565445] <- "Mayaguez"
Interestingly, there were a number of non-ascii characters in sections that did not cause an error, and did not require "fixing" for the database to save. Here are those just in case it helps in some way:
# $Prscrbr_Last_Org_Name
bad_ascii_lastname <- c(22057728, 23970569, 23970570, 23970571, 23970572, 23970573)
df_puf2020$Prscrbr_Last_Org_Name[bad_ascii_lastname]
# [1] "Pe�a-Alvarado" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez"
# $Prscrbr_First_Name
bad_ascii_firstname <- c(10254424, 10254425, 10254426, 10254427, 10254428, 10254429, 10254430, 10254431, 10254432, 10254433, 10254434,
10254435, 10254436, 10254437, 10254438, 10254439, 10254440, 10254441, 10254442, 10254443, 10254444, 10254445,
10254446, 10254447, 10254448, 10254449, 10254450, 10254451, 10254452, 10254453, 10254454, 10254455, 10254456,
10254457, 10254458, 10254459, 10254460, 10254461)
df_puf2020$Prscrbr_First_Name[bad_ascii_firstname] <- "Concepcion" # "Concepci�n"
# [1] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [14] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [27] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
It would seem we need to double-check the UTF verification for R data frame writing
hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value
Could you share the Parquet file with us (e.g. send to my email at [email protected]
)?
sorry, the file data is secret, I write parquet file useing spark fix this exception. thanks
We've had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven't got a minimal reproducible example (because, being on mac, i've never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex
https://github.com/moj-analytical-services/splink/issues/664
How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?
How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?
Via Pandas dataframes
It is likely the unicode ingestion verification there is missing or incomplete - we can have a look to see if we can create a reproducible example by intentionally creating pandas data frames with bogus unicode data.
Thank you. I will also try to find time to see if I can recreate the problem. I've also asked another of our users whether they may be able to share their problematic data.
FWIW, I've tried to reproduce the error on a Windows 11 machine by creating an input csv populated with a variety of values from the big list of naughty strings.
I've been unable to reproduce the error. I think the 'problem' (i.e. the reason it is not causing an error) is probably that my csv file, whilst containing various unicode characters, is correctly encoded in utf-8. But I've experimented by incorrectly saving unicode characters in Windows-1252 encoding as well, with no luck in reproducing the problem either.
My best guess at the moment is that this is something to do with Microsoft SQL Server, which (I believe), usually stores data in Windows-1252 (cp1252) encoding, and how this data gets loaded into Pandas. But this is pure guesswork based on encoding errors I've run into in the past interacting with MS SQL Server, and the fact that all of our users who've seen this error so far have sourced the data from MS SQL Server.
Getting this error in Python when performing a join on two already existing duckdb tables. Have tried in both windows 10, and on a Mac M1. The error happens whenever trying to join specific tables.
The original tables were read from csv files specifying utf-8, with no errors.
Will try and provide two reasonable files to reproduce, but might be hard to narrow down as both tables are rather large (>25mil rows) and from reading the previous, it seems that the problem might be isolated to a single row somewhere?
---------------------------------------------------------------------------
InternalException Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 con.execute("CREATE TABLE med_works_gender AS \
2 SELECT med_works_authors.id as work_id, \
3 med_works_authors.author_id, \
4 authors_gender.ga_gender as author_gender \
5 FROM med_works_authors \
6 LEFT JOIN authors_gender \
7 ON med_works_authors.author_id = authors_gender.id")
9 con.execute("SELECT * FROM med_works_gender LIMIT 5").df()
InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!
Sorry you are running into this!
Maybe you could narrow it down with a binary search using limit and offset?