list of all annotated files in test datasets
I'm accessing the evaluation datasets directly via huggingface datasets like this:
ds = datasets.load_dataset(
"DBD-research-group/BirdSet",
"HSN",
trust_remote_code=True,
cache_dir=cache_dir,
)
classes = ds['test'].info.features['ebird_code'].names
l=ds['test'].info.features['ebird_code']
I can loop over the entries and get the file names, but there may have been audio files that were annotated and had zero annotations. Does the dataset contain a list of all annotated audio files somewhere?
Thanks!
Hi @sammlapp, AFAIK, we do not provide this. Could you also loop over the annotations and then subtract that set from the set of all filenames?
how would I get all of the file names? simply globbing the .ogg files downloaded when the dataset is loaded?
@sammlapp, something like this should work:
import os
import pandas as pd
# Define the directory path
directory_path = '/workspace/data_birdset/HSN/downloads/extracted'
# Walk through the directory and collect all file paths
file_paths = []
for root, dirs, files in os.walk(directory_path):
for file in files:
file_paths.append(os.path.join(root, file))
# Save the file paths in a pandas DataFrame
file_paths_df = pd.DataFrame(file_paths, columns=['File Path'])
# remove filenames that start with XC
file_paths_df = file_paths_df[~file_paths_df['File Path'].str.contains('/XC')]
# only keep filename not the full path
file_paths_df['File Name'] = file_paths_df['File Path'].apply(lambda x: os.path.basename(x))
print(file_paths_df.head())
print(file_paths_df.describe())
ok thanks - consider it a feature request from me to provide the list of annotated audio files as an attribute of dataset.info