Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering

Open tunjan opened this issue 7 months ago • 0 comments

This tool allows users to extract words from their Anki collection, process them (optionally using MeCab for Japanese lemmatization), and export them to a CSV file. It's designed to help users build and maintain lists of known vocabulary, potentially for use with other language learning tools or for analysis.

Core Functionality:

The add-on provides a dialog interface to configure and execute the export process. Users can:

Specify Anki Data Source:
- Filter notes by note type name (e.g., "Japanese," "Basic").
- Select the specific field within those notes that contains the words/sentences to process (e.g., "Expression," "Sentence").
- Set a minimum card interval to only include words from mature cards.
Manage CSV Output:
- Read Existing CSV: Optionally load an existing "known words" CSV. The add-on expects "Word" and "Source" columns.
- Operation Mode:
  - Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as anki source).
  - Save As New CSV: Export all processed words to a new CSV file.
- Automatic Filename Timestamping: Optionally append a _YYYY-MM-DD_HHMMSS timestamp to new CSV filenames.
Advanced Word Processing (especially for Japanese):
- MeCab Lemmatization:
  - If MeCab (Japanese morphological analyzer) is available, users can choose to lemmatize words from the Anki field (e.g., "食べました" -> "食べる").
  - Custom Stopwords: Define a list of custom stopwords (lemmas) to be excluded from the export. These can either supplement built-in stopwords (like する, ある) or replace them entirely.
  - Part-of-Speech (POS) Filtering: Common particles, symbols, prefixes, etc., are automatically filtered out during lemmatization.
  - MeCab Test Tool: A built-in utility allows users to test the current MeCab lemmatization settings (stopwords, POS filtering) on sample Japanese text.
- Basic Word Extraction (if MeCab is unavailable or disabled):
  - Words are extracted by splitting the field content by common delimiters and removing punctuation/HTML.
Dictionary Filtering:
- Optionally filter the extracted words/lemmas against a user-provided dictionary file (plain text, one word per line). Only words/lemmas present in this dictionary will be included in the final CSV.
Settings Persistence:
- The dialog remembers the last used settings (paths, filters, options) for convenience.

Key Components:

ExportVocabCsvDialog: The main Qt dialog for user interaction and settings.
KnownWordsProcessor: Handles the core logic of reading CSVs, fetching/processing Anki data, merging word lists, and writing output CSVs.
MeCabProcessor: Encapsulates all MeCab-related functionality, including initialization, lemmatization, POS filtering, stopword management, and self-testing.
Graceful degradation if MeCab is not installed or properly configured (lemmatization features will be disabled).

Purpose & Use Cases:

Creating a "known words" list for import into reading assistance tools (e.g., browser extensions that highlight known/unknown words on Japanese websites).
Tracking vocabulary acquisition over time.
Generating word lists for further study or analysis.
Migrating vocabulary data between different systems.

May 08 '25 20:05 tunjan