System Info

Transformers 4.16.2 Windows 10 Python 3.9.12 Datasets 2.2.2

@Narsil

Who can help?

@Narsil

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm currently using the zero shot text classifier pipeline with datasets and batching. The "You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset" warning appears with each iteration of my loop. I am using datasets and I am batching. I can't tell if this warning is a bug or just not descriptive enough to help me diagnose the true issue.

# initialize pipeline
classifier = pipeline("zero-shot-classification", model='MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli', device = 0, batch_size = 24)
# convert pandas df to dataset
dataset = Dataset.from_pandas(data)

# loop through documents according to subsamples that contain target name in the text
for i in tqdm(range(len(targets)), desc="Classifying docs"):
        target = targets[i]
        # define template
        template = 'The author of this doc {} ' + target +'.'
        # get a list of text samples that contain the target
        samples = dataset.filter(lambda text: text[targets[i]] == 1)
        # Use classifier to get predictions for each sample
        res = []
        for result in classifier(KeyDataset(samples, 'text'), labels, hypothesis_template = template, multi_label = False, batch_size = 32):
            res.append(result)

# add results to pandas df
data.loc[data[target] == 1, label_col_names[i]] = pd.Series([label['labels'][0] for label in res], index=data.index[data[target] == 1])

As a side note, I appear to be getting significantly worse performance when using datasets and batching vs. just converting samples to a list and classifying sequentially. I'm assuming that's just a function of my data and not related to any bug though.

Expected behavior

Batched classification without the "You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset" warning.

Mar 27 '23 05:03 MLBurnham

Hey, there are a few things:

First:

I cannot really reproduce your example since your data is missing, meaning I'm not able to see exactly what's going on for your particular case.

Second:

There are 2 things at play, streaming vs n-calls and batching vs no-batching. Streaming is always better that doing n-calls for a GPU because in the streaming fashion, we can make use of torch DataLoader meaning using separate thread for data preparation, which should keep the GPU busier. However, this has the most significant impact when the actual GPU runtime is small (making the CPU overhead more visible).

The second is batching, which is not automatically a win: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

In your particular case, using a GTX 970 this is what I get:

No batching, streaming
100%|████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:15<00:00,  6.50it/s]
Batching, streaming
100%|████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 32.92it/s]
No batching, no streaming
  8%|███████▏                                                                                  | 8/100 [00:01<00:14,  6.55it/s]/home/nicolas/src/transformers/src/transformers/pipelines/base.py:1070: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:15<00:00,  6.55it/s]

So it seems batching is helping (understandable here, I have extremely aligned data so no waste of padding and model seems simple enough).

Script:

from transformers import pipeline
import tqdm

# initialize pipeline
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
    device=0,
)


candidate_labels = ["politics", "science", "fashion"]


TOTAL = 100
SENTENCE = "This is a test"


def data():
    for i in range(TOTAL):
        yield SENTENCE


print("No batching, streaming")
for result in tqdm.tqdm(classifier(data(), candidate_labels=candidate_labels), total=TOTAL):
    pass
    # print(result)
print("Batching, streaming")
for result in tqdm.tqdm(classifier(data(), candidate_labels=candidate_labels, batch_size=24), total=TOTAL):
    pass
    # print(result)
print("No batching, no streaming")
for i in tqdm.tqdm(range(TOTAL)):
    result = classifier(SENTENCE, candidate_labels=candidate_labels)
    pass
    # print(result)

Mar 27 '23 07:03 Narsil

Note:

    for result in classifier(KeyDataset(samples, 'text'), labels, hypothesis_template = template, multi_label = False, batch_size = 32):

This is the line of code I'm concerned about. It's perfectly ok if there's a relatively low amount of different labels (meaning low amount of datasets being created). However, if you're creating datasets with very low amount of data, then the overhead of creating the dataset + dataloader + spawning the threads might actually kill performance here.

Mar 27 '23 07:03 Narsil

Thank you for your assistance, this is all very insightful. My dataset is a set of tweets with three categories, I had assumed it was overhead slowing it down but wasn't sure.

That said I'm still not really clear on what is triggering this warning, and it seems to be inconsistent. Passing it via KeyDataset(), a list, or a generator like in your example all seem to trigger the warning but never consistently. In this image I used a generator and the warning wasn't triggered on the first two iterations of the loop, but then was triggered on the third every iteration thereafter.

I once passed the data as a list and the warning wasn't triggered on any iteration of the loop, but when I refreshed the data and re-ran the loop with no changes it was triggered on the second and all subsequent iterations.

Below I've shared the complete code and a sample of the data if that's helpful. This version uses the generator function for batching rather than the KeyDataset() function. The warning is almost always triggered. I tried removing the classification loop from the function as well and the warning still triggered, weirdly on the 7th and 8th iteration of the loop.

import pandas as pd
from transformers import pipeline
from datasets import Dataset
from tqdm import tqdm

# initialize classifier
classifier = pipeline("zero-shot-classification", model='MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli', device = 1, batch_size = 16)

# define data streamer
def data_stream(samples):
    for i in range(samples.num_rows):
        yield samples['text'][i]

# classifier function with batching option
def classify_tweets(targets, labels, label_columns, classifier, data, batching=False):
    """
    Classify tweets based on given targets and labels using a HuggingFace pipeline.

    Args:
    - targets: list of targets in the data frame that will be classified
    - labels: list of labels that will be passed to the template
    - label_columns: name of the label columns
    - classifier: HuggingFace pipeline object
    - data: pandas DataFrame that contains the tweets to classify
    - batching: whether to use batching or not

    Returns:
    - pandas DataFrame with modified columns

    """

    # Create label column names
    label_col_names = [target + '_lab' for target in targets]
    data = data.copy() # suppress setting with copy warning

    # convert to huggingface dataset for batching
    dataset = Dataset.from_pandas(data) if batching else None

    # Classify tweets for each target
    for i in tqdm(range(len(targets)), desc="Classifying tweets"):
        target = targets[i]
        # define template
        template = 'The author of this tweet {} ' + target +'.'

        if batching:
            samples = dataset.filter(lambda text: text[targets[i]] == 1)
            # Use classifier to get predictions for each sample
            res = []
            for result in classifier(data_stream(samples), labels, hypothesis_template = template, multi_label = False, batch_size = 32):
                res.append(result)
        else:
            # Use classifier to get predictions from list of text samples with the target
            res = classifier(list(data.loc[data[target] == 1, 'text']), labels, hypothesis_template=template, multi_label=False)

        # Add results to dataframe
        data.loc[data[target] == 1, label_col_names[i]] = [label['labels'][0] for label in res]

    # recode results to integers
    for column in tqdm(label_col_names, desc="Re-coding results"):
        data.loc[:,column] = data[column].replace(to_replace = {'supports':-1, 'opposes':1, 'does not express an opinion about': 0})
    
    # Fill NaN values with zero
    data[label_col_names] = data[label_col_names].fillna(0)
    # Create columns for liberal and conservative classifications
    data[label_columns + '_lib'] = [1 if label <= -1 else 0 for label in data[label_col_names].sum(axis = 1)]
    data[label_columns + '_con'] = [1 if label >= 1 else 0 for label in data[label_col_names].sum(axis = 1)]

    return data

# define targets to be classified and labels to use
targets = ['Stewart', 'Oliver', 'Maddow', 'Hayes', 'O\'Donnell', 'Klein', 'Krugman', 'Thunberg']
labels = ['supports', 'opposes', 'does not express an opinion about']

lib_df = classify_tweets(targets = targets, labels = labels, label_columns = 'libmed', classifier = classifier, data = lib_df, batching=False)

libsample.csv

Mar 27 '23 19:03 MLBurnham

The warning is generated after simply 10 different calls of the pipeline on GPU (since with streaming there's only 1 call):

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/base.py#L1069

I'll look into this more thoroughly tomorrow.

Mar 27 '23 20:03 Narsil

Ahh that makes sense. So my current loop will trigger the warning regardless of whether or not I'm streaming because it divides the data based on which hypotheses should be used. I'm not sure if there is a more appropriate triggering condition or if the wording of the warning could be tweaked. Might be work a look though in case there is some other poor soul out there like me thinking their data isn't properly streaming/batching.

Appreciate your help!

Mar 28 '23 00:03 MLBurnham

Ok, I had to rework your example so that I could understand what was going on.:

Ultimately I see similar results:

Batching
124it [00:24,  5.07it/s]
No Batching
124it [00:32,  3.77it/s]
Raw iteration|
124it [00:34,  3.63it/s]

In terms of management, the main thing is that your n targets are actually n different datasets. With the snippet I got I don't think it's actually an issue, but with much larger datasets iterating over the ignored values might start to become an significant overhead (especially with added targets).

I think having n different datasets, and iterating on each is perfectly OK.

In order to ignore the warning, you could just reset the call_count. (classifier.call_count = 0) I don't think adding a new parameter is worth the effort since the overhead is still there and the warning can also just be safely ignored. (The warning is there mostly to avoid the naive calls on each separate item which do seem slower in my tests even if not by much)

from transformers import pipeline
import pandas as pd
from datasets import Dataset
from tqdm import tqdm

# initialize classifier
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
    device=0,
)
# define targets to be classified and labels to use
lib_df = pd.read_csv("libsample.csv")
dataset = Dataset.from_pandas(lib_df)
candidate_labels = ["supports", "opposes", "does not express an opinion about"]


def data(dataset, target):
    for row in dataset:
        if row[target]:
            yield row["text"]


# for target in ["Stewart", "Oliver", "Maddow", "Hayes", "O'Donnell", "Klein", "Krugman", "Thunberg"]:
for target in ["Stewart"]:
    hypothesis_template = "The author of this tweet {} " + target + "."
    print("Batching")
    for result in tqdm(
        classifier(
            data(dataset, target),
            candidate_labels=candidate_labels,
            hypothesis_template=hypothesis_template,
            multi_label=False,
            batch_size=32,
        ),
    ):
        pass
    print("No Batching")
    for result in tqdm(
        classifier(
            data(dataset, target),
            candidate_labels=candidate_labels,
            hypothesis_template=hypothesis_template,
            multi_label=False,
            batch_size=1,
        ),
    ):
        pass
        # print(result)
    print("Raw iteration")
    for text in tqdm(
        data(dataset, target),
    ):
        result = classifier(
            text,
            candidate_labels=candidate_labels,
            hypothesis_template=hypothesis_template,
            multi_label=False,
        )
        pass
        # print(result)

Mar 28 '23 14:03 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 26 '23 15:04 github-actions[bot]

I'm encountering a warning similar to what has been previously discussed here. The warning appears when I try to use a Transformers pipeline with a PyTorch DataLoader. My setup involves the following package versions:

transformers==4.37.2 torch==2.1.2 Here's the code snippet that reproduces the issue:

`import torch from torch.utils.data import Dataset, DataLoader import transformers from tqdm import tqdm

class TextDataset(Dataset): def init(self, texts): self.texts = texts

def __len__(self):
    return len(self.texts)

def __getitem__(self, idx):
    text = self.texts[idx]
    return {"text": text}

def process_dataset(data_loader, pipe): all_results = [] for i, batch in tqdm(enumerate(data_loader)): outputs = pipe(batch['text']) all_results.append(outputs) return all_results

Model and tokenizer initialization

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, padding=True) model = transformers.AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map='auto', pad_token_id=tokenizer.eos_token_id )

Pipeline configuration

pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, do_sample=True, return_full_text=False, # Set to True if using with langchain temperature=0.1, # Controls the 'randomness' of outputs top_p=0.15, # Selects from top tokens whose cumulative probability is 15% top_k=0, # Selects from top 0 tokens, relying on top_p instead max_new_tokens=4096, # Maximum number of tokens to generate repetition_penalty=1.1 # Penalizes repetition in output )

Dataset and DataLoader setup

data = TextDataset(['sample text 1', 'sample text 2', 'sample text 3', 'sample text 4']) dataloader = DataLoader(data, batch_size=4, shuffle=False)

Process the dataset

results = process_dataset(dataloader, pipeline) `

Feb 25 '24 18:02 GonyRosenman

Pipeline for inference "You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset"

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Model and tokenizer initialization

Pipeline configuration

Dataset and DataLoader setup

Process the dataset