wilds copied to clipboard
VAL_CENTER Indice in camelyon17_dataset.py
Hi everyone,
I hope this is the right place to ask a question about the Camelyon17 dataset. My question is regarding the center-metadata indices for TEST_CENTER and VAL_CENTER, as defined in the camelyon17_dataset.py file. According to that file, the test and validation (OOD) centers are 0-indexed, with TEST_CENTER at index 2 and VAL_CENTER at index 1. My understanding is that this should correspond to the images shown in columns 5 and 4 of the paper (see the first image for reference). Is that correct?
When I naively plot the images according to their center labels per row (see the second image), I would expect the images for indices 2 and 1 to show the test and validation (OOD) slides in row 2 and 1 (zero-index) as well. Instead, it seems like the (validation) center indices are switched, with the test images corresponding to index 2 and validation (OOD) to index 4 instead of 1. Also inspecting the images directly in the data/patches directory showcases this behaviour e.g. patient 96 from center 4 seems to be Val (ODD) and e.g. patient 34 from center 1 seems to be part of train, at least visually to a layman.
Did I misunderstand something in the indexing or do you have any clue what could be wrong? Below are the images for reference and the code I used to generate them:
Wilds slides:
Slides as per my Code:
Thanks in advance for any clarification!
import os
import pandas as pd
import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from collections import defaultdict
# constants
DATA_DIR = '/data/camelyon17_v1.0'
PATCHES_DIR = os.path.join(DATA_DIR, 'patches')
METADATA_CSV = os.path.join(DATA_DIR, 'metadata.csv')
# Load the metadata
metadata_df = pd.read_csv(
dtype={'patient': 'str'}
# Get labels
y_array = torch.LongTensor(metadata_df['tumor'].values)
# Get input image paths
input_paths = [
for patient, node, x, y in metadata_df[['patient', 'node', 'x_coord', 'y_coord']].values
# Get domains (centers)
centers = metadata_df['center'].astype(int).values
# Organize images into a dictionary keyed by (domain, class)
images_dict = defaultdict(list)
for img_path, label, domain in zip(input_paths, y_array, centers):
key = (domain, label.item())
if len(images_dict[key]) < MAX_IMAGES_PER_COMBINATION:
img = Image.open(img_path).convert('RGB')
except Exception as e:
print(f"Error loading image {img_path}: {e}")
# plot
fig, axes = plt.subplots(nrows=NUM_DOMAINS, ncols=NUM_CLASSES * MAX_IMAGES_PER_COMBINATION, figsize=(24, 12))
plt.subplots_adjust(wspace=0.05, hspace=0.05)
for domain_idx in range(NUM_DOMAINS):
for class_idx in range(NUM_CLASSES):
key = (domain_idx, class_idx)
images = images_dict.get(key, [])
for img_idx in range(MAX_IMAGES_PER_COMBINATION):
col_idx = class_idx * MAX_IMAGES_PER_COMBINATION + img_idx
ax = axes[domain_idx, col_idx]
if img_idx < len(images):
if domain_idx == 0 and img_idx == 0:
ax.set_title(f"Class {class_idx}")
# Add domain labels to the first image in each row
if class_idx == 0:
ax = axes[domain_idx, 0]
ax.text(-30, 32, f"Domain {domain_idx}", rotation=90, va='center', fontsize=12)
#ax.text(-150, images[0].size[1] // 2, f"Domain {domain_idx}", rotation=90, va='center', fontsize=12)
Or very straightforward and then inspect:
import os
from wilds import get_dataset
def save_images():
# Create the 'images' directory if it doesn't exist
if not os.path.exists('images'):
# Load the camelyon17 dataset
dataset = get_dataset(dataset='camelyon17', download=True)
# Get the validation and test subsets
val_data = dataset.get_subset('val')
# Save the first 10 images from the validation set
for i in range(10):
x, y, metadata = val_data[i]
if __name__ == '__main__':
Cheers David
Originally posted by @David-Drexlin in https://github.com/p-lambda/wilds/discussions/163