NeMo-Curator NEMO Curator Not Extracting Thai Language Content

NEMO Curator Not Extracting Thai Language Content

Open ThittipatBeam opened this issue 1 year ago • 0 comments

trafficstars

Describe the bug

I have been using NEMO Curator to extract data from Common Crawl using the function from nemo_curator.download import download_common_crawl. My target language is Thai, but after running an experiment to download data from 10 URLs in the 2024-10 snapshot, none of the extracted data includes any Thai language content.

Upon further investigation, I discovered that the lang_detect function in nemo_curator.download does successfully detect a lot of Thai language content, but the extraction process does not capture it. I am using the JusTextExtractor algorithm for extraction, and it seems that the issue might be related to this step, as the Thai content is identified but not extracted.

Steps/Code to reproduce bug

This code runs in a Jupyter Notebook within a NEMO Framework Docker container.

Download and extract the data from commoncrawl 2024-10 to JSONL

Set the environment %env DASK_DATAFRAME__QUERY_PLANNING False

from nemo_curator.download import download_common_crawl
from nemo_curator.utils.distributed_utils import get_client


client = get_client(cluster_type="cpu", n_workers=10)

start_snapshot = "2024-10"
end_snapshot = "2024-10"
output_directory = "/my/working/dir/data"
url_limit = 10

common_crawl = download_common_crawl(
    output_directory, start_snapshot, end_snapshot, url_limit=url_limit
)

dataset = common_crawl.df.compute()

Preview the data set to explore the language that has been extracted

dataset

Print all the unique languages

unique_languages = dataset['language'].unique()
print(unique_languages)

['VIETNAMESE' 'ENGLISH' 'DUTCH' 'POLISH' 'RUSSIAN' 'FRENCH' 'LITHUANIAN' 'ROMANIAN' 'INDONESIAN' 'CATALAN' 'SPANISH' 'GREEK' 'HUNGARIAN' 'GALICIAN' 'GERMAN' 'PORTUGUESE' 'NORWEGIAN' 'UKRAINIAN' 'HINDI' 'ITALIAN' 'ARABIC' 'PERSIAN' 'BULGARIAN' 'ARMENIAN' 'TURKISH' 'TAMIL' 'KOREAN' 'SWEDISH' 'CZECH' 'SLOVENIAN' 'AZERBAIJANI' 'SLOVAK' 'DANISH' 'MALAY' 'FINNISH' 'ESTONIAN' 'CROATIAN' 'LATVIAN' 'HEBREW' 'GEORGIAN' 'NEPALI' 'SERBIAN' 'MACEDONIAN' 'BENGALI' 'AFRIKAANS' 'TELUGU' 'MALAYALAM' 'BOSNIAN' 'ALBANIAN' 'CEBUANO' 'BASQUE' 'ESPERANTO' 'SWAHILI' 'URDU' 'KANNADA' 'KAZAKH' 'UZBEK' 'KYRGYZ' 'LUXEMBOURGISH' 'BELARUSIAN' 'ICELANDIC' 'TAGALOG' 'MARATHI' 'WELSH' 'NORWEGIAN_N' 'GUJARATI' 'IRISH' 'MALTESE' 'BRETON' 'YORUBA' 'IGBO' 'OCCITAN' 'JAVANESE' 'TURKMEN']

Filter the column "language" to find the Thai language data

filtered_df = dataset[dataset['language'] == 'THAI']
filtered_df

Expected behavior

I expected NEMO Curator to correctly extract Thai language content from the specified URLs in the Common Crawl dataset, particularly since the lang_detect function is detecting Thai. Using the JusTextExtractor algorithm, the extracted dataset should contain Thai text that corresponds to the detected language. I anticipated that the extracted output files would include relevant Thai language content for further analysis, but none of the Thai text is included in the current extraction.

Environment overview (please complete the following information)

Environment location: NEMO Framework Docker container run on virtual machine H100 GPU
Method of NeMo-Curator install:

Docker pull

docker pull nvcr.io/nvidia/nemo:dev

Docker run

docker run \
  -it \                  
  -d \                     
  --name beam_nemo \         
  -v /my/workspace/data:/workspace/data\  
  nvcr.io/nvidia/nemo:dev

Additional context

I have been editing the core code in nemo_curator.download to further investigate the issue. The modified code confirms that the lang_detect function successfully identifies and detects the presence of Thai language content within the Common Crawl dataset

class CommonCrawlWARCExtractor(DocumentExtractor):

    def __init__(self, algorithm=JusTextExtractor()):
        self._stop_lists = get_stop_list_dict()
        self.algorithm = algorithm
        super().__init__()

    def extract(self, content):
        html = decode_html(content)
        if html is not None:
            # Language detection and HTML extraction
            lang = lang_detect(html)
            # Added this print statement to confirm that Thai language is detected
            if lang == "THAI":
                print(lang)
            text = None
            if lang in self._stop_lists:
                text = self.algorithm.extract_text(html, self._stop_lists[lang])
            if text is not None:
                if len(text) > 0:
                    text = "\n\n".join(text)
                    meta = {"language": lang}
                    return meta, text
                else:
                    return None, None

Example output

After running the code dataset = common_crawl.df.compute(), you should expect an output

Oct 04 '24 14:10 ThittipatBeam

NeMo-Curator NeMo-Curator copied to clipboard

NEMO Curator Not Extracting Thai Language Content

NeMo-Curator
NeMo-Curator copied to clipboard