TileDB-R icon indicating copy to clipboard operation
TileDB-R copied to clipboard

Issue on adding enums via `tiledb_array_schema()`

Open cgiachalis opened this issue 2 months ago • 0 comments

The help page of tiledb_array_schema() is not very informative on how to map an enumeration (factors) list to enum attributes.

We can assume that passing a named list of enums where each element corresponds to an enum-attribute should work but that's not always true;and if it does work, the mapping could be incorrect.

The safest option is to map all attributes (setting NULL for non-enums and matching the position order). This is not ideal with many attributes.

Note that fromDataframe() maps all attributes so it doesn't have any issue on mapping the right attributes as enums.

On the example that follows, the creation of schema with enums doesn't throw any error but it messes up the enum mappings.

Source schema helper

test_schema_creation <- function(enum_list) {

sch <- tiledb_array_schema(
  domain = tiledb_domain(c(
    tiledb_dim(
      name = "id",
      domain = c(NULL, NULL),
      tile = NULL,
      type = "ASCII",
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
    )
  )),
  attrs = c(

    tiledb_attr(
      name = "col1",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
      ),
    tiledb_attr(
      name = "enum1",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    ),
    tiledb_attr(
      name = "col2",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
    ),
    tiledb_attr(
      name = "enum2",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    ),
    tiledb_attr(
      name = "enum3",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    )
  ),
  cell_order = "COL_MAJOR",
  tile_order = "COL_MAJOR",
  capacity = 10000,
  sparse = TRUE,
  allows_dups = FALSE,
  coords_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
  )),
  offsets_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
  )),
  validity_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("RLE"), "COMPRESSION_LEVEL", -1)
  )),
  enumerations = enum_list
)
sch
}

library(tiledb) # TileDB R 0.33.1 

# `sch` defines 3 enum attributes: `enum1`, `enum2` and  `enum3` but `tiledb_array_create()` will set all 
#  attributes as enum (case 1).

# case 1: map enum attributes only [NOT OK]
uri <- tempfile()

enums <- list(
    enum1 =  c("A", "B"),
    enum2 = c("yes", "no"),
    enum3 = c("aa")
  )
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
# col1 enum1  col2 enum2 enum3 
 TRUE  TRUE  TRUE  TRUE  TRUE 
 

# case 2: map all attributes with exact order [ OK ]
uri <- tempfile()

enums <- list(
    col1 = NULL,
    enum1 =  c("A", "B"),
    col2 = NULL,
    enum2 = c("yes", "no"),
    enum3 = c("aa")
)
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
# col1 enum1  col2 enum2 enum3 
FALSE  TRUE FALSE  TRUE  TRUE 

# case 3: unorder map all attributes [NOT OK]
uri <- tempfile()
enums <- list(
    enum1 =  c("A", "B"),
    col1 = NULL, 
    col2 = NULL,
    enum2 = c("yes", "no"),
    enum3 = c("aa"))
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
 # col1 enum1  col2 enum2 enum3 
 TRUE  TRUE FALSE  TRUE  TRUE 

The problem is at C++ level within libtiledb_array_schema which assumes any length of enum list but the mapping (C++ loop) will consider the full length of attributes; in the first example, we have enum list of length 3, so the mapping is done on the first 3 attributes col1, enum1 and col2. Surprisingly, the enum2 and enum3 will be set as enums although they are not reachable by C++ loop; so the mapping for those is occurred somewhere else.

https://github.com/TileDB-Inc/TileDB-R/blob/0cb76896c03fbeffa4f057d3ad47140ff0253e3e/src/libtiledb.cpp#L1913-L1930


One solution that fixes the issue could be:

  1. at the R level, add names to attribute list
  2. at the C++ level, match the right attribute using enum name.

I will follow up with PR for discussion.

cgiachalis avatar Oct 14 '25 08:10 cgiachalis