stanza icon indicating copy to clipboard operation
stanza copied to clipboard

How to show progress bar in pipeline? [QUESTION]

Open Hansyvea opened this issue 1 year ago • 4 comments

Hi, I have been using stanza bulkprocess to tokenize and ssplit a rather large text stored in a dataframe. My question is how to show progress bar when running the pipeline?

import stanza
import pandas as pd

dummy_df = pd.read_parquet("../Data/Data_Frame/1987.parquet")
dummy = list(dummy_df.head(1000).TEXT)
nlp = stanza.Pipeline(lang='en', processors='tokenize')
docs = nlp.bulk_process(dummy)
...

Hansyvea avatar Dec 17 '23 18:12 Hansyvea

Sorry, but that functionality currently does not exist (for the tokenize annotator, at least)

AngledLuffa avatar Dec 17 '23 20:12 AngledLuffa

Sorry, but that functionality currently does not exist (for the tokenize annotator, at least)

thank you for your timely reply. Another question emerges while I was test running the tokenizing pipeline, that presently the GPU utilization rate is rather low (5% to 14% of my rtx 3070), how to maximize the GPU usage so as to make the whole process faster?

the data was the NYT annotated corpus, and there are 100'000 articles in the dataframe dummy_df. I tried to run the pipeline on the whole dataframe before sleep only to find the program crashed for unknown reason.

Hansyvea avatar Dec 18 '23 03:12 Hansyvea

Certain annotators do quite a bit of their manipulation on the CPU. Fixing that and getting better GPU utilization would be a bit of a project. It is on the list of things to do, though

AngledLuffa avatar Dec 18 '23 04:12 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 17 '24 11:03 stale[bot]