BirdSet icon indicating copy to clipboard operation
BirdSet copied to clipboard

list of all annotated files in test datasets

Open sammlapp opened this issue 10 months ago • 4 comments

I'm accessing the evaluation datasets directly via huggingface datasets like this:

ds = datasets.load_dataset(
        "DBD-research-group/BirdSet",
        "HSN",
        trust_remote_code=True,
        cache_dir=cache_dir,
    )
    classes = ds['test'].info.features['ebird_code'].names
    l=ds['test'].info.features['ebird_code']

I can loop over the entries and get the file names, but there may have been audio files that were annotated and had zero annotations. Does the dataset contain a list of all annotated audio files somewhere?

Thanks!

sammlapp avatar Apr 17 '25 20:04 sammlapp

Hi @sammlapp, AFAIK, we do not provide this. Could you also loop over the annotations and then subtract that set from the set of all filenames?

raphaelschwinger avatar Apr 22 '25 13:04 raphaelschwinger

how would I get all of the file names? simply globbing the .ogg files downloaded when the dataset is loaded?

sammlapp avatar Apr 22 '25 14:04 sammlapp

@sammlapp, something like this should work:

import os
import pandas as pd

# Define the directory path
directory_path = '/workspace/data_birdset/HSN/downloads/extracted'

# Walk through the directory and collect all file paths
file_paths = []
for root, dirs, files in os.walk(directory_path):
    for file in files:
        file_paths.append(os.path.join(root, file))

# Save the file paths in a pandas DataFrame
file_paths_df = pd.DataFrame(file_paths, columns=['File Path'])

# remove filenames that start with XC
file_paths_df = file_paths_df[~file_paths_df['File Path'].str.contains('/XC')]
# only keep filename not the full path
file_paths_df['File Name'] = file_paths_df['File Path'].apply(lambda x: os.path.basename(x)) 
print(file_paths_df.head())
print(file_paths_df.describe())

raphaelschwinger avatar Apr 22 '25 14:04 raphaelschwinger

ok thanks - consider it a feature request from me to provide the list of annotated audio files as an attribute of dataset.info

sammlapp avatar Apr 24 '25 19:04 sammlapp