arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

`nonexistent datasets` check in `read_all`

Open stress-tess opened this issue 2 years ago • 4 comments

Capturing conversation in #1146:

The else block in read_all builds nonexistent with any datasets in datasets which can't be found and raises aValueError https://github.com/Bears-R-Us/arkouda/blob/883d7010ebd6464751c6e948f2914c48125d7fda/arkouda/pdarrayIO.py#L284-L294

It look like since L289-290 should never run because it's identical to the check at L286-287.

@bmcdonald3 noted (paraphrased):

I think that 289-290 will never run in the current state, so that can likely be removed. Nonexistent is checking is whether the datasets exist in the file. For example, say you had a file with this schema: {"col1": int_col, "col2": str_col} and you call ak.read_all("my-file", ['col1', 'bad-name']) then nonexistent would be set('bad-name') because it is the set of the input datasets without the ones that actually exist in the file. It hits the else case because it is not just a string, but a list of datasets. I think that check should be moved outside of the else and should always run over the datasets to make sure no bad datasets slip by

As ben suggested, I think read_all should be updated to be something like

    if datasets is None:
        datasets = get_datasets_allow_errors(filenames) if allow_errors else get_datasets(filenames[0])
    if isinstance(datasets, str):
        datasets = [datasets]

    nonexistent = set(datasets) - \
        (set(get_datasets_allow_errors(filenames)) if allow_errors else set(get_datasets(filenames[0])))
    if len(nonexistent) > 0:
        raise ValueError("Dataset(s) not found: {}".format(nonexistent))

Feedback from @reuster986 would be great since (I believe) he is the original author of much of the IO code and can hopefully let us know if we're missing something here

Note: if this change is made, io_test.py would likely need to be updated as bad dataset names will be caught sooner and will result in a ValueError instead of a RuntimeError

Originally posted by @pierce314159 in https://github.com/Bears-R-Us/arkouda/pull/1146#discussion_r816151850

stress-tess avatar Feb 28 '22 22:02 stress-tess

I agree with the consensus here -- removing the redundant check in 289-290 and moving the "nonexistent" check outside the if/else so it always runs. I'm not sure why I wrote it that way to begin with, it's been so long... I was probably just coding fast and made a mistake. Good catch @pierce314159 and @bmcdonald3 !

reuster986 avatar Mar 01 '22 13:03 reuster986

Noting that this code has moved to pdarrayIO.py:read() https://github.com/Bears-R-Us/arkouda/blob/acc1a44141a46afba4cb9962477a159429a5725b/arkouda/pdarrayIO.py#L125-L137

Ethan-DeBandi99 avatar Apr 19 '22 18:04 Ethan-DeBandi99

Upon making the update and running some tests, I believe the check is inside the if block due to being able to pass in dataset values that reach into a GROUP for a sub-dataset.

For example: Saving a Strings array with dataset name strings will create a GROUP with two DATASETs under it, values and segments. It's possible to access these by passing strings/values or strings/segments to the read method. These will fail the existence check since it is only checking at the GROUP strings level for an exact match of strings/values or strings/segments.

My initial thought is to go through and split the passed names on / and ensure the first portion exists. I don't know if we have a way to look deeper to check existence of all parts.

Does anyone have a preference on how this is handled? @pierce314159 @Ethan-DeBandi99 @reuster986 @bmcdonald3

This is a truncated version of what it looks like in an hdf5 file for reference

GROUP "/" {
   GROUP "_arkouda_metadata" {
      .... Contents
   }
   GROUP "strings" {
      DATASET "segments" {
         DATA {
         (0): 0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 177, 194, 211,
         (14): 228, 245, 262, 279, 296, 313, 330, 347, 364, 381, 398
         }
      }
      DATASET "values" {
         DATATYPE  H5T_STD_U8LE
         DATASPACE  SIMPLE { ( 415 ) / ( 415 ) }
         DATA {
         (0): 116, 101, 115, 116, 105, 110, 103, 32, 115, 116, 114, 105, 110,
         (13): 103, 48, 0, 116, 101, 115, 116, 105, 110, 103, 32, 115, 116,
         }
      }
   }
}
}

joshmarshall1 avatar May 03 '22 13:05 joshmarshall1

@joshmarshall1 - I agree with you. We will need to split things apart to read them. There are a lot of things that need to be updated within the HDF5 workflows. Based on how things are used currently, I think removing the existence check here is fine because we aren't doing checks in the manner you mentioned anywhere currently. I don't disagree that we should be able to. That is part of the updates I am working towards for HDF5 and making it more full featured.

Ethan-DeBandi99 avatar Jul 06 '22 17:07 Ethan-DeBandi99