Japanese
Japanese copied to clipboard
Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering
This tool allows users to extract words from their Anki collection, process them (optionally using MeCab for Japanese lemmatization), and export them to a CSV file. It's designed to help users build and maintain lists of known vocabulary, potentially for use with other language learning tools or for analysis.
Core Functionality:
The add-on provides a dialog interface to configure and execute the export process. Users can:
-
Specify Anki Data Source:
- Filter notes by note type name (e.g., "Japanese," "Basic").
- Select the specific field within those notes that contains the words/sentences to process (e.g., "Expression," "Sentence").
- Set a minimum card interval to only include words from mature cards.
-
Manage CSV Output:
- Read Existing CSV: Optionally load an existing "known words" CSV. The add-on expects "Word" and "Source" columns.
- Operation Mode:
- Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as
ankisource). - Save As New CSV: Export all processed words to a new CSV file.
- Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as
- Automatic Filename Timestamping: Optionally append a
_YYYY-MM-DD_HHMMSStimestamp to new CSV filenames.
-
Advanced Word Processing (especially for Japanese):
- MeCab Lemmatization:
- If MeCab (Japanese morphological analyzer) is available, users can choose to lemmatize words from the Anki field (e.g., "ι£γΉγΎγγ" -> "ι£γΉγ").
- Custom Stopwords: Define a list of custom stopwords (lemmas) to be excluded from the export. These can either supplement built-in stopwords (like γγ, γγ) or replace them entirely.
- Part-of-Speech (POS) Filtering: Common particles, symbols, prefixes, etc., are automatically filtered out during lemmatization.
- MeCab Test Tool: A built-in utility allows users to test the current MeCab lemmatization settings (stopwords, POS filtering) on sample Japanese text.
- Basic Word Extraction (if MeCab is unavailable or disabled):
- Words are extracted by splitting the field content by common delimiters and removing punctuation/HTML.
- MeCab Lemmatization:
-
Dictionary Filtering:
- Optionally filter the extracted words/lemmas against a user-provided dictionary file (plain text, one word per line). Only words/lemmas present in this dictionary will be included in the final CSV.
-
Settings Persistence:
- The dialog remembers the last used settings (paths, filters, options) for convenience.
Key Components:
ExportVocabCsvDialog: The main Qt dialog for user interaction and settings.KnownWordsProcessor: Handles the core logic of reading CSVs, fetching/processing Anki data, merging word lists, and writing output CSVs.MeCabProcessor: Encapsulates all MeCab-related functionality, including initialization, lemmatization, POS filtering, stopword management, and self-testing.- Graceful degradation if MeCab is not installed or properly configured (lemmatization features will be disabled).
Purpose & Use Cases:
- Creating a "known words" list for import into reading assistance tools (e.g., browser extensions that highlight known/unknown words on Japanese websites).
- Tracking vocabulary acquisition over time.
- Generating word lists for further study or analysis.
- Migrating vocabulary data between different systems.