dataset-viewer icon indicating copy to clipboard operation
dataset-viewer copied to clipboard

Raise specific errors (and error_code) instead of UnexpectedError

Open severo opened this issue 1 year ago • 23 comments

The following query on the production database gives the number of datasets with at least one cache entry with error_code "UnexpectedError", grouped by the underlying "cause_exception".

For the most common ones (DatasetGenerationError, HfHubHTTPError, OSError, etc.) we would benefit from raising a specific error with its error code. It would allow to:

  • retry automatically, if needed
  • show an adequate error message to the users
  • even: adapt the way we show the dataset viewer on the Hub

null means it has no details.cause_exception. These cache entries should be inspected more closely. See https://github.com/huggingface/datasets-server/issues/1123 in particular, which is one of the cases where no cause exception is reported.

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 1964 }
{ _id: null, count: 1388 }
{ _id: 'HfHubHTTPError', count: 1154 }
{ _id: 'OSError', count: 433 }
{ _id: 'FileNotFoundError', count: 242 }
{ _id: 'FileExistsError', count: 198 }
{ _id: 'ValueError', count: 186 }
{ _id: 'TypeError', count: 160 }
{ _id: 'ConnectionError', count: 146 }
{ _id: 'RuntimeError', count: 86 }
{ _id: 'NonMatchingSplitsSizesError', count: 83 }
{ _id: 'FileSystemError', count: 62 }
{ _id: 'ClientResponseError', count: 52 }
{ _id: 'ArrowInvalid', count: 45 }
{ _id: 'ParquetResponseEmptyError', count: 43 }
{ _id: 'RepositoryNotFoundError', count: 41 }
{ _id: 'ManualDownloadError', count: 39 }
{ _id: 'IndexError', count: 28 }
{ _id: 'AttributeError', count: 16 }
{ _id: 'KeyError', count: 15 }
{ _id: 'GatedRepoError', count: 13 }
{ _id: 'NotImplementedError', count: 11 }
{ _id: 'ExpectedMoreSplits', count: 9 }
{ _id: 'PermissionError', count: 8 }
{ _id: 'BadRequestError', count: 7 }
{ _id: 'NonMatchingChecksumError', count: 6 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'UnboundLocalError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'ZeroDivisionError', count: 3 }
{ _id: 'InvalidDocument', count: 3 }
{ _id: 'DoesNotExist', count: 3 }
{ _id: 'EOFError', count: 3 }
{ _id: 'ImportError', count: 3 }
{ _id: 'NotADirectoryError', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'ReadTimeout', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'ExpectedMoreDownloadedFiles', count: 2 }
{ _id: 'InvalidConfigName', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'MissingBeamOptions', count: 2 }
{ _id: 'HTTPError', count: 1 }
{ _id: 'BadZipFile', count: 1 }
{ _id: 'OverflowError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'IsADirectoryError', count: 1 }
{ _id: 'OperationalError', count: 1 }

severo avatar Jun 28 '23 08:06 severo

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 28 '23 15:07 github-actions[bot]

We need to do it to provide better feedback to the user, and to retry when appropriate.

severo avatar Aug 07 '23 15:08 severo

Copying from #1462

Updated query (Without errors from parent):

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", kind:"split-duckdb-index", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])

From 128617 records currently existing in cache collection, these are the top kind of UnexpectedErrors:

[
  { _id: { cause: 'HfHubHTTPError' }, count: 4429 },
  { _id: { cause: 'HTTPException' }, count: 2570 },
  { _id: { cause: 'Error' }, count: 54 },
  { _id: { cause: 'BinderException' }, count: 41 },
  { _id: { cause: 'CatalogException' }, count: 38 },
  { _id: { cause: 'ParserException' }, count: 29 },
  { _id: { cause: 'InvalidInputException' }, count: 22 },
  { _id: { cause: 'RuntimeError' }, count: 8 },
  { _id: { cause: 'IOException' }, count: 5 },
  { _id: { cause: 'BadRequestError' }, count: 2 },
  { _id: { cause: 'NotPrimaryError' }, count: 2 },
  { _id: { cause: 'EntryNotFoundError' }, count: 2 }
]

Since this is a new job runner, most of these should be evaluated in case there is a bug in the code.

severo avatar Aug 11 '23 15:08 severo

Updating list: datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}]) [ { _id: { cause: 'AttributeError' }, count: 9876 }, { _id: { cause: 'ClientResponseError' }, count: 6034 }, { _id: { cause: 'DatasetGenerationError' }, count: 5674 }, { _id: { cause: 'ParserException' }, count: 3058 }, { _id: { cause: 'TypeError' }, count: 2689 }, { _id: { cause: 'IOException' }, count: 1961 }, { _id: { cause: 'InvalidInputException' }, count: 1814 }, { _id: { cause: 'ZeroDivisionError' }, count: 1693 }, { _id: { cause: 'FileNotFoundError' }, count: 1687 }, { _id: { cause: 'HfHubHTTPError' }, count: 1316 }, { _id: { cause: 'HTTPException' }, count: 1216 }, { _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1141 }, { _id: { cause: 'EntryNotFoundError' }, count: 895 }, { _id: { cause: 'ValueError' }, count: 827 }, { _id: { cause: 'BinderException' }, count: 789 }, { _id: { cause: 'KeyError' }, count: 608 }, { _id: { cause: 'ParquetResponseEmptyError' }, count: 598 }, { _id: { cause: 'NotImplementedError' }, count: 509 }, { _id: { cause: 'CachedArtifactNotFoundError' }, count: 457 }, { _id: { cause: null }, count: 370 } { _id: { cause: 'ReadTimeout' }, count: 329 }, { _id: { cause: 'ConnectionError' }, count: 264 }, { _id: { cause: 'LocationParseError' }, count: 191 }, { _id: { cause: 'OSError' }, count: 186 }, { _id: { cause: 'IndexError' }, count: 155 }, { _id: { cause: 'AssertionError' }, count: 84 }, { _id: { cause: 'BadZipFile' }, count: 63 }, { _id: { cause: 'ArrowInvalid' }, count: 57 }, { _id: { cause: 'OutOfRangeException' }, count: 53 }, { _id: { cause: 'CatalogException' }, count: 44 }, { _id: { cause: 'ModuleNotFoundError' }, count: 41 }, { _id: { cause: 'RuntimeError' }, count: 39 }, { _id: { cause: 'LocalEntryNotFoundError' }, count: 26 }, { _id: { cause: 'UnboundLocalError' }, count: 26 }, { _id: { cause: 'FileExistsError' }, count: 24 }, { _id: { cause: 'Error' }, count: 24 }, { _id: { cause: 'RepositoryNotFoundError' }, count: 21 }, { _id: { cause: 'InvalidOperation' }, count: 16 }, { _id: { cause: 'ExpectedMoreSplits' }, count: 15 }, { _id: { cause: 'ImportError' }, count: 12 } { _id: { cause: 'ServerDisconnectedError' }, count: 11 }, { _id: { cause: 'NameError' }, count: 9 }, { _id: { cause: 'SyntaxError' }, count: 8 }, { _id: { cause: 'PermissionError' }, count: 6 }, { _id: { cause: 'InternalException' }, count: 5 }, { _id: { cause: 'ChunkedEncodingError' }, count: 5 }, { _id: { cause: 'InvalidDocument' }, count: 4 }, { _id: { cause: 'ParserError' }, count: 3 }, { _id: { cause: 'DoesNotExist' }, count: 3 }, { _id: { cause: 'ConversionException' }, count: 3 }, { _id: { cause: 'NonStreamableDatasetError' }, count: 3 }, { _id: { cause: 'SSLError' }, count: 3 }, { _id: { cause: 'Exception' }, count: 3 }, { _id: { cause: 'GatedRepoError' }, count: 3 }, { _id: { cause: 'JSONDecodeError' }, count: 2 }, { _id: { cause: 'InvalidConfigName' }, count: 2 }, { _id: { cause: 'FileSystemError' }, count: 1 }, { _id: { cause: 'AutoReconnect' }, count: 1 }, { _id: { cause: 'TypeMismatchException' }, count: 1 }, { _id: { cause: 'HFValidationError' }, count: 1 } { _id: { cause: 'EOFError' }, count: 1 }, { _id: { cause: 'OperationalError' }, count: 1 }, { _id: { cause: 'TransactionException' }, count: 1 }, { _id: { cause: 'NotPrimaryError' }, count: 1 }, { _id: { cause: 'UnicodeDecodeError' }, count: 1 }, { _id: { cause: 'OutOfMemoryException' }, count: 1 } ]

AndreaFrancis avatar Sep 11 '23 19:09 AndreaFrancis

After doing some cache maintenance actions manually (removing obsolete records which config or split no longer exist) this is the updated list mostly AttributeError and ClientResponseError reduced:

[
  { _id: { cause: 'DatasetGenerationError' }, count: 3791 },
  { _id: { cause: 'TypeError' }, count: 2222 },
  { _id: { cause: 'ParserException' }, count: 2095 },
  { _id: { cause: 'InvalidInputException' }, count: 1750 },
  { _id: { cause: 'FileNotFoundError' }, count: 1393 },
  { _id: { cause: 'ZeroDivisionError' }, count: 1224 },
  { _id: { cause: 'HfHubHTTPError' }, count: 1128 },
  { _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1116 },
  { _id: { cause: 'IOException' }, count: 1035 },
  { _id: { cause: 'CachedArtifactNotFoundError' }, count: 745 },
  { _id: { cause: 'HTTPException' }, count: 526 },
  { _id: { cause: 'NotImplementedError' }, count: 493 },
  { _id: { cause: 'BinderException' }, count: 462 },
  { _id: { cause: 'KeyError' }, count: 454 },
  { _id: { cause: 'ReadTimeout' }, count: 311 },
  { _id: { cause: 'ParquetResponseEmptyError' }, count: 292 },
  { _id: { cause: 'ConnectionError' }, count: 201 },
  { _id: { cause: 'ValueError' }, count: 187 },
  { _id: { cause: 'AttributeError' }, count: 127 },
  { _id: { cause: 'IndexError' }, count: 107 },
  { _id: { cause: 'OSError' }, count: 102 },
  { _id: { cause: 'ClientResponseError' }, count: 94 },
  { _id: { cause: 'EntryNotFoundError' }, count: 92 },
  { _id: { cause: 'AssertionError' }, count: 84 },
  { _id: { cause: 'BadZipFile' }, count: 61 },
  { _id: { cause: 'OutOfRangeException' }, count: 46 },
  { _id: { cause: 'ModuleNotFoundError' }, count: 43 },
  { _id: { cause: 'LocationParseError' }, count: 29 },
  { _id: { cause: 'ArrowInvalid' }, count: 28 },
  { _id: { cause: 'CatalogException' }, count: 26 },
  { _id: { cause: 'LocalEntryNotFoundError' }, count: 19 },
  { _id: { cause: 'Error' }, count: 16 },
  { _id: { cause: 'ServerDisconnectedError' }, count: 9 },
  { _id: { cause: 'SyntaxError' }, count: 8 },
  { _id: { cause: 'InvalidOperation' }, count: 8 },
  { _id: { cause: 'RuntimeError' }, count: 7 },
  { _id: { cause: 'PermissionError' }, count: 6 },
  { _id: { cause: 'UnboundLocalError' }, count: 6 },
  { _id: { cause: 'NameError' }, count: 5 },
  { _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
  { _id: { cause: 'Exception' }, count: 3 },
  { _id: { cause: 'ChunkedEncodingError' }, count: 3 },
  { _id: { cause: 'SSLError' }, count: 3 },
  { _id: { cause: 'ExpectedMoreSplits' }, count: 2 },
  { _id: { cause: 'ConversionException' }, count: 2 },
  { _id: { cause: null }, count: 2 },
  { _id: { cause: 'ParserError' }, count: 2 },
  { _id: { cause: 'RepositoryNotFoundError' }, count: 2 },
  { _id: { cause: 'OperationalError' }, count: 1 },
  { _id: { cause: 'UnicodeDecodeError' }, count: 1 },
  { _id: { cause: 'TransactionException' }, count: 1 },
  { _id: { cause: 'OutOfMemoryException' }, count: 1 },
  { _id: { cause: 'DoesNotExist' }, count: 1 },
  { _id: { cause: 'ImportError' }, count: 1 },
  { _id: { cause: 'HFValidationError' }, count: 1 },
  { _id: { cause: 'JSONDecodeError' }, count: 1 },
  { _id: { cause: 'EOFError' }, count: 1 },
  { _id: { cause: 'TypeMismatchException' }, count: 1 },
  { _id: { cause: 'InternalException' }, count: 1 }
]

AndreaFrancis avatar Sep 18 '23 11:09 AndreaFrancis

Update of UnexpectedErrors count by kind:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 9117 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 6685 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 591 },
  { _id: { kindkind: 'split-first-rows-from-parquet' }, count: 11 }
]

For split-first-rows-from-parquet it will be fixed with https://github.com/huggingface/datasets-server/pull/2126

AndreaFrancis avatar Nov 17 '23 15:11 AndreaFrancis

interesting that only 4 steps produce all the unexpected errors

severo avatar Nov 17 '23 16:11 severo

For KeyError, see https://github.com/huggingface/huggingface_hub/issues/1853

severo avatar Nov 22 '23 10:11 severo

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 2767 }
{ _id: 'HfHubHTTPError', count: 795 }
{ _id: 'TypeError', count: 633 }
{ _id: 'ZeroDivisionError', count: 621 }
{ _id: 'IOException', count: 514 }
{ _id: 'ReadTimeout', count: 245 }
{ _id: 'OSError', count: 151 }
{ _id: 'BinderException', count: 127 }
{ _id: 'ConnectionError', count: 119 }
{ _id: 'ValueError', count: 103 }
{ _id: 'ParserException', count: 91 }
{ _id: 'EntryNotFoundError', count: 66 }
{ _id: 'NotImplementedError', count: 66 }
{ _id: 'FileNotFoundError', count: 60 }
{ _id: 'NonMatchingSplitsSizesError', count: 43 }
{ _id: 'BrokenPipeError', count: 39 }
{ _id: 'InvalidInputException', count: 36 }
{ _id: 'IndexError', count: 30 }
{ _id: 'OutOfRangeException', count: 30 }
{ _id: 'HTTPException', count: 21 }
{ _id: 'LocationParseError', count: 17 }
{ _id: 'RuntimeError', count: 15 }
{ _id: 'KeyError', count: 13 }
{ _id: 'BadZipFile', count: 9 }
{ _id: 'Error', count: 7 }
{ _id: 'ExpectedMoreSplits', count: 5 }
{ _id: 'ArrowInvalid', count: 5 }
{ _id: 'ConversionException', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'ModuleNotFoundError', count: 3 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'NotPrimaryError', count: 3 }
{ _id: 'ParserError', count: 3 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'LocalEntryNotFoundError', count: 2 }
{ _id: 'RepositoryNotFoundError', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'JSONDecodeError', count: 1 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'RarCannotExec', count: 1 }
{ _id: 'OutOfMemoryException', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'SyntaxError', count: 1 }
{ _id: 'UnicodeDecodeError', count: 1 }
{ _id: 'EOFError', count: 1 }

severo avatar Nov 22 '23 10:11 severo

Updated list of UnexpectedErrors by kind:

[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 8500 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 2628 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 794 }
]

AndreaFrancis avatar Dec 18 '23 20:12 AndreaFrancis

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 3963 }
{ _id: 'TypeError', count: 958 }
{ _id: 'HfHubHTTPError', count: 778 }
{ _id: 'DatasetGenerationCastError', count: 287 }
{ _id: 'OSError', count: 219 }
{ _id: 'ValueError', count: 182 }
{ _id: 'ReadTimeout', count: 172 }
{ _id: 'ParserException', count: 127 }
{ _id: 'BinderException', count: 108 }
{ _id: 'ConnectionError', count: 103 }
{ _id: 'EntryNotFoundError', count: 77 }
{ _id: 'InvalidInputException', count: 76 }
{ _id: 'IOException', count: 72 }
{ _id: 'NotImplementedError', count: 69 }
{ _id: 'FileNotFoundError', count: 59 }
{ _id: 'ComputeError', count: 57 }
{ _id: 'NonMatchingSplitsSizesError', count: 50 }
{ _id: 'ColumnNotFoundError', count: 46 }
{ _id: 'RuntimeError', count: 34 }
{ _id: 'IndexError', count: 25 }
{ _id: 'ConversionException', count: 23 }
{ _id: 'HTTPException', count: 20 }
{ _id: 'ZeroDivisionError', count: 19 }
{ _id: 'LocationParseError', count: 15 }
{ _id: 'KeyError', count: 12 }
{ _id: 'BadZipFile', count: 11 }
{ _id: 'ArrowInvalid', count: 10 }
{ _id: 'ExpectedMoreSplits', count: 8 }
{ _id: 'ParserError', count: 8 }
{ _id: 'Error', count: 8 }
{ _id: 'InvalidOperationError', count: 7 }
{ _id: 'SchemaError', count: 5 }
{ _id: 'ReadError', count: 5 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'ArrowCapacityError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'DuplicateError', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TransactionException', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'UnicodeDecodeError', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'OutOfRangeException', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'NotPrimaryError', count: 1 }
{ _id: 'RepositoryNotFoundError', count: 1 }
{ _id: 'LocalEntryNotFoundError', count: 1 }
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kindkind: 'config-parquet-and-info' }, count: 9338 }
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 2868 }
{ _id: { kindkind: 'split-duckdb-index' }, count: 847 }
{ _id: { kindkind: 'split-first-rows-from-parquet' }, count: 2 }

severo avatar Feb 06 '24 14:02 severo

I would bet that most errors occur for datasets with a script. I propose to recreate all of these datasets... In most cases, it will create a DatasetWithScriptNotSupportedError error instead of some weird-looking error.

Number of unique datasets:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError" } },
    { $group: { _id: null, uniqueValues: { $addToSet: "$dataset" } } },
    { $project: { _id: 0, uniqueValues: 1 } },
    { $unwind: "$uniqueValues" },
    { $group: { _id: null, count: { $sum: 1 } } },
    { $project: { _id: 0, count: 1 } }
]);
{ count: 7484 }

I'm recreating the datasets one by one, with:

DATASETS=(...)
for dataset in ${DATASETS[@]}; do curl -H "Authorization: Bearer $HF_TOKEN" -X POST "https://datasets-server.huggingface.co/admin/recreate-dataset?dataset=$dataset&priority=low"; done;

Scaled the admin service from 2 to 4, let's see if it improves something.

They are processing at a rate of 1 request per second (approximate value). So: hopefully in two hours we should be done

severo avatar Feb 06 '24 22:02 severo

Today:

number of datasets, by step and cause exception
db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { "_id.kind": 1, count: -1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);
{ kind: 'config-parquet-and-info', num_datasets: 2486, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 1226, cause: 'DatasetGenerationCastError' }
{ kind: 'config-parquet-and-info', num_datasets: 575, cause: 'OSError' }
{ kind: 'config-parquet-and-info', num_datasets: 64, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 32, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 30, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 14, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 8, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'ReadError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'ArrowCapacityError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'IndexError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'ExpectedMoreSplits' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'RarCannotExec' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'JSONDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'AttributeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ModuleNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'split-descriptive-statistics', num_datasets: 935, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 56, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 12, cause: 'ComputeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 5, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-duckdb-index', num_datasets: 123, cause: 'InvalidInputException' }
{ kind: 'split-duckdb-index', num_datasets: 109, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'IOException' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'ConversionException' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'Error' }
{ kind: 'split-duckdb-index', num_datasets: 2, cause: 'TypeMismatchException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }

severo avatar Feb 08 '24 13:02 severo

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])

[
  { _id: { kind: 'config-parquet-and-info' }, count: 6215 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 2173 },
  { _id: { kind: 'split-duckdb-index' }, count: 2034 },
  { _id: { kind: 'split-duckdb-index-010' }, count: 777 },
  { _id: { kind: 'split-first-rows' }, count: 1 }
]

AndreaFrancis avatar Feb 23 '24 13:02 AndreaFrancis

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}]) 
[
  { _id: { kind: 'config-parquet-and-info' }, count: 7373 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3808 },
  { _id: { kind: 'split-duckdb-index' }, count: 3285 },
  { _id: { kind: 'split-first-rows' }, count: 206 }
]

AndreaFrancis avatar Mar 04 '24 17:03 AndreaFrancis

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'config-parquet-and-info' }, count: 6668 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3667 },
  { _id: { kind: 'split-duckdb-index' }, count: 2941 },
  { _id: { kind: 'dataset-loading-tags' }, count: 1539 },
  { _id: { kind: 'split-first-rows' }, count: 30 }
]

AndreaFrancis avatar Mar 14 '24 19:03 AndreaFrancis

The last PR (#2796) has a big impact!

72K -> 20K entries

Capture d’écran 2024-05-14 à 08 47 29 Capture d’écran 2024-05-14 à 08 47 35

Replaced with 36K DatasetGenerationError and 12K DatasetGenerationCastError

Capture d’écran 2024-05-14 à 08 49 38 Capture d’écran 2024-05-14 à 08 49 44

severo avatar May 14 '24 06:05 severo

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kind: 'split-duckdb-index' }, count: 2871 }
{ _id: { kind: 'dataset-compatible-libraries' }, count: 2546 }
{ _id: { kind: 'split-descriptive-statistics' }, count: 1683 }
{ _id: { kind: 'config-parquet-and-info' }, count: 1407 }
{ _id: { kind: 'split-first-rows' }, count: 68 }
{ _id: { kind: 'split-image-url-columns' }, count: 2 }

severo avatar May 14 '24 06:05 severo

After refreshing some records:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1380 },
  { _id: { kind: 'config-parquet-and-info' }, count: 1171 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 676 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 619 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]

AndreaFrancis avatar May 14 '24 19:05 AndreaFrancis

Today (Almost half of yesterday's):

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1236 },
  { _id: { kind: 'config-parquet-and-info' }, count: 588 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 301 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 209 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}})
2405

AndreaFrancis avatar May 15 '24 14:05 AndreaFrancis

Today:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { count: -1, "_id.kind": 1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);

{ kind: 'dataset-compatible-libraries', num_datasets: 1507, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 288, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 262, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 203, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 181, cause: 'UnidentifiedImageError' }
{ kind: 'dataset-filetypes', num_datasets: 160, cause: 'BadZipFile' }
{ kind: 'split-descriptive-statistics', num_datasets: 157, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 148, cause: 'PermissionError' }
{ kind: 'split-duckdb-index', num_datasets: 144, cause: 'BinderException' }
{ kind: 'dataset-filetypes', num_datasets: 140, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 134, cause: 'ReadTimeout' }
{ kind: 'split-descriptive-statistics', num_datasets: 121, cause: 'ValueError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 96, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 93, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 77, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 73, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 69, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 65, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 52, cause: 'ReadError' }
{ kind: 'split-first-rows', num_datasets: 52, cause: 'ServerDisconnectedError' }
{ kind: 'split-duckdb-index', num_datasets: 50, cause: 'SchemaError' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'ComputeError' }
{ kind: 'split-duckdb-index', num_datasets: 48, cause: 'InvalidInputException' }
{ kind: 'config-parquet-and-info', num_datasets: 44, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 42, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 40, cause: 'ColumnNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 40, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ConnectionError' }
{ kind: 'split-duckdb-index', num_datasets: 32, cause: 'EntryNotFoundError' }
{ kind: 'dataset-filetypes', num_datasets: 31, cause: 'TypeError' }
{ kind: 'split-first-rows', num_datasets: 28, cause: 'ClientResponseError' }
{ kind: 'config-parquet-and-info', num_datasets: 25, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 24, cause: 'ArrowTypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 24, cause: 'EntryNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 24, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 21, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 19, cause: 'KeyError' }
{ kind: 'dataset-filetypes', num_datasets: 19, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 19, cause: 'DecompressionBombError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'IndexError' }
{ kind: 'split-descriptive-statistics', num_datasets: 14, cause: 'ComputeError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'ArrowCapacityError' }
{ kind: 'dataset-filetypes', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 10, cause: 'IOException' }
{ kind: 'split-first-rows', num_datasets: 10, cause: 'AttributeError' }
{ kind: 'split-first-rows', num_datasets: 9, cause: 'OSError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'KeyError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'split-first-rows', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'OSError' }
{ kind: 'split-first-rows', num_datasets: 7, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'JSONDecodeError' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'InternalException' }
{ kind: 'split-image-url-columns', num_datasets: 6, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'HTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'ConversionException' }
{ kind: 'config-parquet-and-info', num_datasets: 4, cause: 'DatasetGenerationCastError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 4, cause: 'TypeMismatchException' }
{ kind: 'split-first-rows', num_datasets: 4, cause: 'FSTimeoutError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'UnpicklingError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'ExpectedMoreSplits' }
{ kind: 'split-duckdb-index', num_datasets: 3, cause: 'Error' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 2, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'KeyError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ChunkedEncodingError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'IsADirectoryError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EmptyDataError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EOFError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 1, cause: 'EmptyDatasetError' }
{ kind: 'dataset-filetypes', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'TypeError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'error' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'OutOfMemoryException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientConnectorError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientPayloadError' }

severo avatar Jul 30 '24 16:07 severo

Note that we currently have 14K UnexpectedError entries, which is about 0.1% of the total cache entries. So: not that crucial either. I'll reduce the priority.

Maybe more important is to replace ConfigNamesError with the underlying error (100K entries). And to explicit more the DatasetGenerationError (50K entries) to help users debug their data files.

severo avatar Aug 01 '24 11:08 severo

I created https://github.com/huggingface/dataset-viewer/issues/3010 and https://github.com/huggingface/dataset-viewer/issues/3011

severo avatar Aug 01 '24 11:08 severo