NeMo-Curator
NeMo-Curator copied to clipboard
NEMO Curator Not Extracting Thai Language Content
Describe the bug
I have been using NEMO Curator to extract data from Common Crawl using the function from nemo_curator.download import download_common_crawl. My target language is Thai, but after running an experiment to download data from 10 URLs in the 2024-10 snapshot, none of the extracted data includes any Thai language content.
Upon further investigation, I discovered that the lang_detect function in nemo_curator.download does successfully detect a lot of Thai language content, but the extraction process does not capture it. I am using the JusTextExtractor algorithm for extraction, and it seems that the issue might be related to this step, as the Thai content is identified but not extracted.
Steps/Code to reproduce bug
This code runs in a Jupyter Notebook within a NEMO Framework Docker container.
- Download and extract the data from commoncrawl 2024-10 to JSONL
Set the environment
%env DASK_DATAFRAME__QUERY_PLANNING False
from nemo_curator.download import download_common_crawl
from nemo_curator.utils.distributed_utils import get_client
client = get_client(cluster_type="cpu", n_workers=10)
start_snapshot = "2024-10"
end_snapshot = "2024-10"
output_directory = "/my/working/dir/data"
url_limit = 10
common_crawl = download_common_crawl(
output_directory, start_snapshot, end_snapshot, url_limit=url_limit
)
dataset = common_crawl.df.compute()
- Preview the data set to explore the language that has been extracted
dataset
Print all the unique languages
unique_languages = dataset['language'].unique()
print(unique_languages)
['VIETNAMESE' 'ENGLISH' 'DUTCH' 'POLISH' 'RUSSIAN' 'FRENCH' 'LITHUANIAN' 'ROMANIAN' 'INDONESIAN' 'CATALAN' 'SPANISH' 'GREEK' 'HUNGARIAN' 'GALICIAN' 'GERMAN' 'PORTUGUESE' 'NORWEGIAN' 'UKRAINIAN' 'HINDI' 'ITALIAN' 'ARABIC' 'PERSIAN' 'BULGARIAN' 'ARMENIAN' 'TURKISH' 'TAMIL' 'KOREAN' 'SWEDISH' 'CZECH' 'SLOVENIAN' 'AZERBAIJANI' 'SLOVAK' 'DANISH' 'MALAY' 'FINNISH' 'ESTONIAN' 'CROATIAN' 'LATVIAN' 'HEBREW' 'GEORGIAN' 'NEPALI' 'SERBIAN' 'MACEDONIAN' 'BENGALI' 'AFRIKAANS' 'TELUGU' 'MALAYALAM' 'BOSNIAN' 'ALBANIAN' 'CEBUANO' 'BASQUE' 'ESPERANTO' 'SWAHILI' 'URDU' 'KANNADA' 'KAZAKH' 'UZBEK' 'KYRGYZ' 'LUXEMBOURGISH' 'BELARUSIAN' 'ICELANDIC' 'TAGALOG' 'MARATHI' 'WELSH' 'NORWEGIAN_N' 'GUJARATI' 'IRISH' 'MALTESE' 'BRETON' 'YORUBA' 'IGBO' 'OCCITAN' 'JAVANESE' 'TURKMEN']
Filter the column "language" to find the Thai language data
filtered_df = dataset[dataset['language'] == 'THAI']
filtered_df
Expected behavior
I expected NEMO Curator to correctly extract Thai language content from the specified URLs in the Common Crawl dataset, particularly since the lang_detect function is detecting Thai. Using the JusTextExtractor algorithm, the extracted dataset should contain Thai text that corresponds to the detected language. I anticipated that the extracted output files would include relevant Thai language content for further analysis, but none of the Thai text is included in the current extraction.
Environment overview (please complete the following information)
- Environment location: NEMO Framework Docker container run on virtual machine H100 GPU
- Method of NeMo-Curator install:
Docker pull
docker pull nvcr.io/nvidia/nemo:dev
Docker run
docker run \
-it \
-d \
--name beam_nemo \
-v /my/workspace/data:/workspace/data\
nvcr.io/nvidia/nemo:dev
Additional context
I have been editing the core code in nemo_curator.download to further investigate the issue. The modified code confirms that the lang_detect function successfully identifies and detects the presence of Thai language content within the Common Crawl dataset
class CommonCrawlWARCExtractor(DocumentExtractor):
def __init__(self, algorithm=JusTextExtractor()):
self._stop_lists = get_stop_list_dict()
self.algorithm = algorithm
super().__init__()
def extract(self, content):
html = decode_html(content)
if html is not None:
# Language detection and HTML extraction
lang = lang_detect(html)
# Added this print statement to confirm that Thai language is detected
if lang == "THAI":
print(lang)
text = None
if lang in self._stop_lists:
text = self.algorithm.extract_text(html, self._stop_lists[lang])
if text is not None:
if len(text) > 0:
text = "\n\n".join(text)
meta = {"language": lang}
return meta, text
else:
return None, None
Example output
After running the code dataset = common_crawl.df.compute(), you should expect an output