dataset-viewer
dataset-viewer copied to clipboard
Raise specific errors (and error_code) instead of UnexpectedError
The following query on the production database gives the number of datasets with at least one cache entry with error_code "UnexpectedError", grouped by the underlying "cause_exception".
For the most common ones (DatasetGenerationError
, HfHubHTTPError
, OSError
, etc.) we would benefit from raising a specific error with its error code. It would allow to:
- retry automatically, if needed
- show an adequate error message to the users
- even: adapt the way we show the dataset viewer on the Hub
null
means it has no details.cause_exception
. These cache entries should be inspected more closely. See https://github.com/huggingface/datasets-server/issues/1123 in particular, which is one of the cases where no cause exception is reported.
db.cachedResponsesBlue.aggregate([
{$match: {error_code: "UnexpectedError"}},
{$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
{$group: {_id: "$_id.cause", count: {$sum: 1}}},
{$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 1964 }
{ _id: null, count: 1388 }
{ _id: 'HfHubHTTPError', count: 1154 }
{ _id: 'OSError', count: 433 }
{ _id: 'FileNotFoundError', count: 242 }
{ _id: 'FileExistsError', count: 198 }
{ _id: 'ValueError', count: 186 }
{ _id: 'TypeError', count: 160 }
{ _id: 'ConnectionError', count: 146 }
{ _id: 'RuntimeError', count: 86 }
{ _id: 'NonMatchingSplitsSizesError', count: 83 }
{ _id: 'FileSystemError', count: 62 }
{ _id: 'ClientResponseError', count: 52 }
{ _id: 'ArrowInvalid', count: 45 }
{ _id: 'ParquetResponseEmptyError', count: 43 }
{ _id: 'RepositoryNotFoundError', count: 41 }
{ _id: 'ManualDownloadError', count: 39 }
{ _id: 'IndexError', count: 28 }
{ _id: 'AttributeError', count: 16 }
{ _id: 'KeyError', count: 15 }
{ _id: 'GatedRepoError', count: 13 }
{ _id: 'NotImplementedError', count: 11 }
{ _id: 'ExpectedMoreSplits', count: 9 }
{ _id: 'PermissionError', count: 8 }
{ _id: 'BadRequestError', count: 7 }
{ _id: 'NonMatchingChecksumError', count: 6 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'UnboundLocalError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'ZeroDivisionError', count: 3 }
{ _id: 'InvalidDocument', count: 3 }
{ _id: 'DoesNotExist', count: 3 }
{ _id: 'EOFError', count: 3 }
{ _id: 'ImportError', count: 3 }
{ _id: 'NotADirectoryError', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'ReadTimeout', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'ExpectedMoreDownloadedFiles', count: 2 }
{ _id: 'InvalidConfigName', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'MissingBeamOptions', count: 2 }
{ _id: 'HTTPError', count: 1 }
{ _id: 'BadZipFile', count: 1 }
{ _id: 'OverflowError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'IsADirectoryError', count: 1 }
{ _id: 'OperationalError', count: 1 }
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
We need to do it to provide better feedback to the user, and to retry when appropriate.
Copying from #1462
Updated query (Without errors from parent):
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", kind:"split-duckdb-index", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])
From 128617 records currently existing in cache collection, these are the top kind of UnexpectedErrors:
[
{ _id: { cause: 'HfHubHTTPError' }, count: 4429 },
{ _id: { cause: 'HTTPException' }, count: 2570 },
{ _id: { cause: 'Error' }, count: 54 },
{ _id: { cause: 'BinderException' }, count: 41 },
{ _id: { cause: 'CatalogException' }, count: 38 },
{ _id: { cause: 'ParserException' }, count: 29 },
{ _id: { cause: 'InvalidInputException' }, count: 22 },
{ _id: { cause: 'RuntimeError' }, count: 8 },
{ _id: { cause: 'IOException' }, count: 5 },
{ _id: { cause: 'BadRequestError' }, count: 2 },
{ _id: { cause: 'NotPrimaryError' }, count: 2 },
{ _id: { cause: 'EntryNotFoundError' }, count: 2 }
]
Since this is a new job runner, most of these should be evaluated in case there is a bug in the code.
Updating list:
datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { cause: 'AttributeError' }, count: 9876 },
{ _id: { cause: 'ClientResponseError' }, count: 6034 },
{ _id: { cause: 'DatasetGenerationError' }, count: 5674 },
{ _id: { cause: 'ParserException' }, count: 3058 },
{ _id: { cause: 'TypeError' }, count: 2689 },
{ _id: { cause: 'IOException' }, count: 1961 },
{ _id: { cause: 'InvalidInputException' }, count: 1814 },
{ _id: { cause: 'ZeroDivisionError' }, count: 1693 },
{ _id: { cause: 'FileNotFoundError' }, count: 1687 },
{ _id: { cause: 'HfHubHTTPError' }, count: 1316 },
{ _id: { cause: 'HTTPException' }, count: 1216 },
{ _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1141 },
{ _id: { cause: 'EntryNotFoundError' }, count: 895 },
{ _id: { cause: 'ValueError' }, count: 827 },
{ _id: { cause: 'BinderException' }, count: 789 },
{ _id: { cause: 'KeyError' }, count: 608 },
{ _id: { cause: 'ParquetResponseEmptyError' }, count: 598 },
{ _id: { cause: 'NotImplementedError' }, count: 509 },
{ _id: { cause: 'CachedArtifactNotFoundError' }, count: 457 },
{ _id: { cause: null }, count: 370 }
{ _id: { cause: 'ReadTimeout' }, count: 329 },
{ _id: { cause: 'ConnectionError' }, count: 264 },
{ _id: { cause: 'LocationParseError' }, count: 191 },
{ _id: { cause: 'OSError' }, count: 186 },
{ _id: { cause: 'IndexError' }, count: 155 },
{ _id: { cause: 'AssertionError' }, count: 84 },
{ _id: { cause: 'BadZipFile' }, count: 63 },
{ _id: { cause: 'ArrowInvalid' }, count: 57 },
{ _id: { cause: 'OutOfRangeException' }, count: 53 },
{ _id: { cause: 'CatalogException' }, count: 44 },
{ _id: { cause: 'ModuleNotFoundError' }, count: 41 },
{ _id: { cause: 'RuntimeError' }, count: 39 },
{ _id: { cause: 'LocalEntryNotFoundError' }, count: 26 },
{ _id: { cause: 'UnboundLocalError' }, count: 26 },
{ _id: { cause: 'FileExistsError' }, count: 24 },
{ _id: { cause: 'Error' }, count: 24 },
{ _id: { cause: 'RepositoryNotFoundError' }, count: 21 },
{ _id: { cause: 'InvalidOperation' }, count: 16 },
{ _id: { cause: 'ExpectedMoreSplits' }, count: 15 },
{ _id: { cause: 'ImportError' }, count: 12 }
{ _id: { cause: 'ServerDisconnectedError' }, count: 11 },
{ _id: { cause: 'NameError' }, count: 9 },
{ _id: { cause: 'SyntaxError' }, count: 8 },
{ _id: { cause: 'PermissionError' }, count: 6 },
{ _id: { cause: 'InternalException' }, count: 5 },
{ _id: { cause: 'ChunkedEncodingError' }, count: 5 },
{ _id: { cause: 'InvalidDocument' }, count: 4 },
{ _id: { cause: 'ParserError' }, count: 3 },
{ _id: { cause: 'DoesNotExist' }, count: 3 },
{ _id: { cause: 'ConversionException' }, count: 3 },
{ _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
{ _id: { cause: 'SSLError' }, count: 3 },
{ _id: { cause: 'Exception' }, count: 3 },
{ _id: { cause: 'GatedRepoError' }, count: 3 },
{ _id: { cause: 'JSONDecodeError' }, count: 2 },
{ _id: { cause: 'InvalidConfigName' }, count: 2 },
{ _id: { cause: 'FileSystemError' }, count: 1 },
{ _id: { cause: 'AutoReconnect' }, count: 1 },
{ _id: { cause: 'TypeMismatchException' }, count: 1 },
{ _id: { cause: 'HFValidationError' }, count: 1 }
{ _id: { cause: 'EOFError' }, count: 1 },
{ _id: { cause: 'OperationalError' }, count: 1 },
{ _id: { cause: 'TransactionException' }, count: 1 },
{ _id: { cause: 'NotPrimaryError' }, count: 1 },
{ _id: { cause: 'UnicodeDecodeError' }, count: 1 },
{ _id: { cause: 'OutOfMemoryException' }, count: 1 }
]
After doing some cache maintenance actions manually (removing obsolete records which config or split no longer exist) this is the updated list mostly AttributeError and ClientResponseError reduced:
[
{ _id: { cause: 'DatasetGenerationError' }, count: 3791 },
{ _id: { cause: 'TypeError' }, count: 2222 },
{ _id: { cause: 'ParserException' }, count: 2095 },
{ _id: { cause: 'InvalidInputException' }, count: 1750 },
{ _id: { cause: 'FileNotFoundError' }, count: 1393 },
{ _id: { cause: 'ZeroDivisionError' }, count: 1224 },
{ _id: { cause: 'HfHubHTTPError' }, count: 1128 },
{ _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1116 },
{ _id: { cause: 'IOException' }, count: 1035 },
{ _id: { cause: 'CachedArtifactNotFoundError' }, count: 745 },
{ _id: { cause: 'HTTPException' }, count: 526 },
{ _id: { cause: 'NotImplementedError' }, count: 493 },
{ _id: { cause: 'BinderException' }, count: 462 },
{ _id: { cause: 'KeyError' }, count: 454 },
{ _id: { cause: 'ReadTimeout' }, count: 311 },
{ _id: { cause: 'ParquetResponseEmptyError' }, count: 292 },
{ _id: { cause: 'ConnectionError' }, count: 201 },
{ _id: { cause: 'ValueError' }, count: 187 },
{ _id: { cause: 'AttributeError' }, count: 127 },
{ _id: { cause: 'IndexError' }, count: 107 },
{ _id: { cause: 'OSError' }, count: 102 },
{ _id: { cause: 'ClientResponseError' }, count: 94 },
{ _id: { cause: 'EntryNotFoundError' }, count: 92 },
{ _id: { cause: 'AssertionError' }, count: 84 },
{ _id: { cause: 'BadZipFile' }, count: 61 },
{ _id: { cause: 'OutOfRangeException' }, count: 46 },
{ _id: { cause: 'ModuleNotFoundError' }, count: 43 },
{ _id: { cause: 'LocationParseError' }, count: 29 },
{ _id: { cause: 'ArrowInvalid' }, count: 28 },
{ _id: { cause: 'CatalogException' }, count: 26 },
{ _id: { cause: 'LocalEntryNotFoundError' }, count: 19 },
{ _id: { cause: 'Error' }, count: 16 },
{ _id: { cause: 'ServerDisconnectedError' }, count: 9 },
{ _id: { cause: 'SyntaxError' }, count: 8 },
{ _id: { cause: 'InvalidOperation' }, count: 8 },
{ _id: { cause: 'RuntimeError' }, count: 7 },
{ _id: { cause: 'PermissionError' }, count: 6 },
{ _id: { cause: 'UnboundLocalError' }, count: 6 },
{ _id: { cause: 'NameError' }, count: 5 },
{ _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
{ _id: { cause: 'Exception' }, count: 3 },
{ _id: { cause: 'ChunkedEncodingError' }, count: 3 },
{ _id: { cause: 'SSLError' }, count: 3 },
{ _id: { cause: 'ExpectedMoreSplits' }, count: 2 },
{ _id: { cause: 'ConversionException' }, count: 2 },
{ _id: { cause: null }, count: 2 },
{ _id: { cause: 'ParserError' }, count: 2 },
{ _id: { cause: 'RepositoryNotFoundError' }, count: 2 },
{ _id: { cause: 'OperationalError' }, count: 1 },
{ _id: { cause: 'UnicodeDecodeError' }, count: 1 },
{ _id: { cause: 'TransactionException' }, count: 1 },
{ _id: { cause: 'OutOfMemoryException' }, count: 1 },
{ _id: { cause: 'DoesNotExist' }, count: 1 },
{ _id: { cause: 'ImportError' }, count: 1 },
{ _id: { cause: 'HFValidationError' }, count: 1 },
{ _id: { cause: 'JSONDecodeError' }, count: 1 },
{ _id: { cause: 'EOFError' }, count: 1 },
{ _id: { cause: 'TypeMismatchException' }, count: 1 },
{ _id: { cause: 'InternalException' }, count: 1 }
]
Update of UnexpectedErrors count by kind:
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kindkind: 'config-parquet-and-info' }, count: 9117 },
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 6685 },
{ _id: { kindkind: 'split-duckdb-index' }, count: 591 },
{ _id: { kindkind: 'split-first-rows-from-parquet' }, count: 11 }
]
For split-first-rows-from-parquet it will be fixed with https://github.com/huggingface/datasets-server/pull/2126
interesting that only 4 steps produce all the unexpected errors
For KeyError
, see https://github.com/huggingface/huggingface_hub/issues/1853
Current state:
db.cachedResponsesBlue.aggregate([
{$match: {error_code: "UnexpectedError"}},
{$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
{$group: {_id: "$_id.cause", count: {$sum: 1}}},
{$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 2767 }
{ _id: 'HfHubHTTPError', count: 795 }
{ _id: 'TypeError', count: 633 }
{ _id: 'ZeroDivisionError', count: 621 }
{ _id: 'IOException', count: 514 }
{ _id: 'ReadTimeout', count: 245 }
{ _id: 'OSError', count: 151 }
{ _id: 'BinderException', count: 127 }
{ _id: 'ConnectionError', count: 119 }
{ _id: 'ValueError', count: 103 }
{ _id: 'ParserException', count: 91 }
{ _id: 'EntryNotFoundError', count: 66 }
{ _id: 'NotImplementedError', count: 66 }
{ _id: 'FileNotFoundError', count: 60 }
{ _id: 'NonMatchingSplitsSizesError', count: 43 }
{ _id: 'BrokenPipeError', count: 39 }
{ _id: 'InvalidInputException', count: 36 }
{ _id: 'IndexError', count: 30 }
{ _id: 'OutOfRangeException', count: 30 }
{ _id: 'HTTPException', count: 21 }
{ _id: 'LocationParseError', count: 17 }
{ _id: 'RuntimeError', count: 15 }
{ _id: 'KeyError', count: 13 }
{ _id: 'BadZipFile', count: 9 }
{ _id: 'Error', count: 7 }
{ _id: 'ExpectedMoreSplits', count: 5 }
{ _id: 'ArrowInvalid', count: 5 }
{ _id: 'ConversionException', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'ModuleNotFoundError', count: 3 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'NotPrimaryError', count: 3 }
{ _id: 'ParserError', count: 3 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'LocalEntryNotFoundError', count: 2 }
{ _id: 'RepositoryNotFoundError', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'JSONDecodeError', count: 1 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'RarCannotExec', count: 1 }
{ _id: 'OutOfMemoryException', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'SyntaxError', count: 1 }
{ _id: 'UnicodeDecodeError', count: 1 }
{ _id: 'EOFError', count: 1 }
Updated list of UnexpectedErrors by kind:
[
{ _id: { kindkind: 'config-parquet-and-info' }, count: 8500 },
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 2628 },
{ _id: { kindkind: 'split-duckdb-index' }, count: 794 }
]
Current state:
db.cachedResponsesBlue.aggregate([
{$match: {error_code: "UnexpectedError"}},
{$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
{$group: {_id: "$_id.cause", count: {$sum: 1}}},
{$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 3963 }
{ _id: 'TypeError', count: 958 }
{ _id: 'HfHubHTTPError', count: 778 }
{ _id: 'DatasetGenerationCastError', count: 287 }
{ _id: 'OSError', count: 219 }
{ _id: 'ValueError', count: 182 }
{ _id: 'ReadTimeout', count: 172 }
{ _id: 'ParserException', count: 127 }
{ _id: 'BinderException', count: 108 }
{ _id: 'ConnectionError', count: 103 }
{ _id: 'EntryNotFoundError', count: 77 }
{ _id: 'InvalidInputException', count: 76 }
{ _id: 'IOException', count: 72 }
{ _id: 'NotImplementedError', count: 69 }
{ _id: 'FileNotFoundError', count: 59 }
{ _id: 'ComputeError', count: 57 }
{ _id: 'NonMatchingSplitsSizesError', count: 50 }
{ _id: 'ColumnNotFoundError', count: 46 }
{ _id: 'RuntimeError', count: 34 }
{ _id: 'IndexError', count: 25 }
{ _id: 'ConversionException', count: 23 }
{ _id: 'HTTPException', count: 20 }
{ _id: 'ZeroDivisionError', count: 19 }
{ _id: 'LocationParseError', count: 15 }
{ _id: 'KeyError', count: 12 }
{ _id: 'BadZipFile', count: 11 }
{ _id: 'ArrowInvalid', count: 10 }
{ _id: 'ExpectedMoreSplits', count: 8 }
{ _id: 'ParserError', count: 8 }
{ _id: 'Error', count: 8 }
{ _id: 'InvalidOperationError', count: 7 }
{ _id: 'SchemaError', count: 5 }
{ _id: 'ReadError', count: 5 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'ArrowCapacityError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'DuplicateError', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TransactionException', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'UnicodeDecodeError', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'OutOfRangeException', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'NotPrimaryError', count: 1 }
{ _id: 'RepositoryNotFoundError', count: 1 }
{ _id: 'LocalEntryNotFoundError', count: 1 }
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kindkind: 'config-parquet-and-info' }, count: 9338 }
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 2868 }
{ _id: { kindkind: 'split-duckdb-index' }, count: 847 }
{ _id: { kindkind: 'split-first-rows-from-parquet' }, count: 2 }
I would bet that most errors occur for datasets with a script. I propose to recreate all of these datasets... In most cases, it will create a DatasetWithScriptNotSupportedError error instead of some weird-looking error.
Number of unique datasets:
db.cachedResponsesBlue.aggregate([
{ $match: { error_code: "UnexpectedError" } },
{ $group: { _id: null, uniqueValues: { $addToSet: "$dataset" } } },
{ $project: { _id: 0, uniqueValues: 1 } },
{ $unwind: "$uniqueValues" },
{ $group: { _id: null, count: { $sum: 1 } } },
{ $project: { _id: 0, count: 1 } }
]);
{ count: 7484 }
I'm recreating the datasets one by one, with:
DATASETS=(...)
for dataset in ${DATASETS[@]}; do curl -H "Authorization: Bearer $HF_TOKEN" -X POST "https://datasets-server.huggingface.co/admin/recreate-dataset?dataset=$dataset&priority=low"; done;
Scaled the admin service from 2 to 4, let's see if it improves something.
They are processing at a rate of 1 request per second (approximate value). So: hopefully in two hours we should be done
Today:
number of datasets, by step and cause exception
db.cachedResponsesBlue.aggregate([ { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } }, { $group: { _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" }, count: { $sum: 1 }, }, }, { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } }, { $sort: { "_id.kind": 1, count: -1 } }, { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } ]);
{ kind: 'config-parquet-and-info', num_datasets: 2486, cause: 'DatasetGenerationError' } { kind: 'config-parquet-and-info', num_datasets: 1226, cause: 'DatasetGenerationCastError' } { kind: 'config-parquet-and-info', num_datasets: 575, cause: 'OSError' } { kind: 'config-parquet-and-info', num_datasets: 64, cause: 'ValueError' } { kind: 'config-parquet-and-info', num_datasets: 32, cause: 'NotImplementedError' } { kind: 'config-parquet-and-info', num_datasets: 30, cause: 'NonMatchingSplitsSizesError' } { kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' } { kind: 'config-parquet-and-info', num_datasets: 15, cause: 'RuntimeError' } { kind: 'config-parquet-and-info', num_datasets: 14, cause: 'ArrowInvalid' } { kind: 'config-parquet-and-info', num_datasets: 11, cause: 'HfHubHTTPError' } { kind: 'config-parquet-and-info', num_datasets: 8, cause: 'ParserError' } { kind: 'config-parquet-and-info', num_datasets: 7, cause: 'BadZipFile' } { kind: 'config-parquet-and-info', num_datasets: 6, cause: 'ReadError' } { kind: 'config-parquet-and-info', num_datasets: 5, cause: 'ArrowCapacityError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'TypeError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'IndexError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'ExpectedMoreSplits' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'RarCannotExec' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'JSONDecodeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'AttributeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ModuleNotFoundError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'FileNotFoundError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'UnicodeDecodeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ConnectionError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' } { kind: 'split-descriptive-statistics', num_datasets: 935, cause: 'TypeError' } { kind: 'split-descriptive-statistics', num_datasets: 56, cause: 'ValueError' } { kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ColumnNotFoundError' } { kind: 'split-descriptive-statistics', num_datasets: 12, cause: 'ComputeError' } { kind: 'split-descriptive-statistics', num_datasets: 5, cause: 'InvalidOperationError' } { kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'SchemaError' } { kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' } { kind: 'split-duckdb-index', num_datasets: 123, cause: 'InvalidInputException' } { kind: 'split-duckdb-index', num_datasets: 109, cause: 'ParserException' } { kind: 'split-duckdb-index', num_datasets: 49, cause: 'IOException' } { kind: 'split-duckdb-index', num_datasets: 6, cause: 'ConversionException' } { kind: 'split-duckdb-index', num_datasets: 5, cause: 'Error' } { kind: 'split-duckdb-index', num_datasets: 2, cause: 'TypeMismatchException' } { kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
Today:
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kind: 'config-parquet-and-info' }, count: 6215 },
{ _id: { kind: 'split-descriptive-statistics' }, count: 2173 },
{ _id: { kind: 'split-duckdb-index' }, count: 2034 },
{ _id: { kind: 'split-duckdb-index-010' }, count: 777 },
{ _id: { kind: 'split-first-rows' }, count: 1 }
]
Today:
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kind: 'config-parquet-and-info' }, count: 7373 },
{ _id: { kind: 'split-descriptive-statistics' }, count: 3808 },
{ _id: { kind: 'split-duckdb-index' }, count: 3285 },
{ _id: { kind: 'split-first-rows' }, count: 206 }
]
Today:
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kind: 'config-parquet-and-info' }, count: 6668 },
{ _id: { kind: 'split-descriptive-statistics' }, count: 3667 },
{ _id: { kind: 'split-duckdb-index' }, count: 2941 },
{ _id: { kind: 'dataset-loading-tags' }, count: 1539 },
{ _id: { kind: 'split-first-rows' }, count: 30 }
]
The last PR (#2796) has a big impact!
72K -> 20K entries
Replaced with 36K DatasetGenerationError and 12K DatasetGenerationCastError
Today:
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kind: 'split-duckdb-index' }, count: 2871 }
{ _id: { kind: 'dataset-compatible-libraries' }, count: 2546 }
{ _id: { kind: 'split-descriptive-statistics' }, count: 1683 }
{ _id: { kind: 'config-parquet-and-info' }, count: 1407 }
{ _id: { kind: 'split-first-rows' }, count: 68 }
{ _id: { kind: 'split-image-url-columns' }, count: 2 }
After refreshing some records:
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kind: 'split-duckdb-index' }, count: 1380 },
{ _id: { kind: 'config-parquet-and-info' }, count: 1171 },
{ _id: { kind: 'split-descriptive-statistics' }, count: 676 },
{ _id: { kind: 'dataset-compatible-libraries' }, count: 619 },
{ _id: { kind: 'split-first-rows' }, count: 68 },
{ _id: { kind: 'split-image-url-columns' }, count: 2 }
]
Today (Almost half of yesterday's):
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { kind: 'split-duckdb-index' }, count: 1236 },
{ _id: { kind: 'config-parquet-and-info' }, count: 588 },
{ _id: { kind: 'split-descriptive-statistics' }, count: 301 },
{ _id: { kind: 'dataset-compatible-libraries' }, count: 209 },
{ _id: { kind: 'split-first-rows' }, count: 68 },
{ _id: { kind: 'split-image-url-columns' }, count: 2 }
]
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}})
2405
Today:
db.cachedResponsesBlue.aggregate([
{ $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
{
$group: {
_id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
count: { $sum: 1 },
},
},
{ $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
{ $sort: { count: -1, "_id.kind": 1 } },
{ $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } }
]);
{ kind: 'dataset-compatible-libraries', num_datasets: 1507, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 288, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 262, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 203, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 181, cause: 'UnidentifiedImageError' }
{ kind: 'dataset-filetypes', num_datasets: 160, cause: 'BadZipFile' }
{ kind: 'split-descriptive-statistics', num_datasets: 157, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 148, cause: 'PermissionError' }
{ kind: 'split-duckdb-index', num_datasets: 144, cause: 'BinderException' }
{ kind: 'dataset-filetypes', num_datasets: 140, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 134, cause: 'ReadTimeout' }
{ kind: 'split-descriptive-statistics', num_datasets: 121, cause: 'ValueError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 96, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 93, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 77, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 73, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 69, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 65, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 52, cause: 'ReadError' }
{ kind: 'split-first-rows', num_datasets: 52, cause: 'ServerDisconnectedError' }
{ kind: 'split-duckdb-index', num_datasets: 50, cause: 'SchemaError' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'ComputeError' }
{ kind: 'split-duckdb-index', num_datasets: 48, cause: 'InvalidInputException' }
{ kind: 'config-parquet-and-info', num_datasets: 44, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 42, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 40, cause: 'ColumnNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 40, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ConnectionError' }
{ kind: 'split-duckdb-index', num_datasets: 32, cause: 'EntryNotFoundError' }
{ kind: 'dataset-filetypes', num_datasets: 31, cause: 'TypeError' }
{ kind: 'split-first-rows', num_datasets: 28, cause: 'ClientResponseError' }
{ kind: 'config-parquet-and-info', num_datasets: 25, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 24, cause: 'ArrowTypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 24, cause: 'EntryNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 24, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 21, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 19, cause: 'KeyError' }
{ kind: 'dataset-filetypes', num_datasets: 19, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 19, cause: 'DecompressionBombError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'IndexError' }
{ kind: 'split-descriptive-statistics', num_datasets: 14, cause: 'ComputeError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'ArrowCapacityError' }
{ kind: 'dataset-filetypes', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 10, cause: 'IOException' }
{ kind: 'split-first-rows', num_datasets: 10, cause: 'AttributeError' }
{ kind: 'split-first-rows', num_datasets: 9, cause: 'OSError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'KeyError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'split-first-rows', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'OSError' }
{ kind: 'split-first-rows', num_datasets: 7, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'JSONDecodeError' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'InternalException' }
{ kind: 'split-image-url-columns', num_datasets: 6, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'HTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'ConversionException' }
{ kind: 'config-parquet-and-info', num_datasets: 4, cause: 'DatasetGenerationCastError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 4, cause: 'TypeMismatchException' }
{ kind: 'split-first-rows', num_datasets: 4, cause: 'FSTimeoutError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'UnpicklingError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'ExpectedMoreSplits' }
{ kind: 'split-duckdb-index', num_datasets: 3, cause: 'Error' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 2, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'KeyError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ChunkedEncodingError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'IsADirectoryError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EmptyDataError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EOFError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 1, cause: 'EmptyDatasetError' }
{ kind: 'dataset-filetypes', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'TypeError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'error' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'OutOfMemoryException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientConnectorError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientPayloadError' }
Note that we currently have 14K UnexpectedError entries, which is about 0.1% of the total cache entries. So: not that crucial either. I'll reduce the priority.
Maybe more important is to replace ConfigNamesError
with the underlying error (100K entries). And to explicit more the DatasetGenerationError
(50K entries) to help users debug their data files.
I created https://github.com/huggingface/dataset-viewer/issues/3010 and https://github.com/huggingface/dataset-viewer/issues/3011