Japanese icon indicating copy to clipboard operation
Japanese copied to clipboard

Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering

Open tunjan opened this issue 7 months ago β€’ 0 comments

This tool allows users to extract words from their Anki collection, process them (optionally using MeCab for Japanese lemmatization), and export them to a CSV file. It's designed to help users build and maintain lists of known vocabulary, potentially for use with other language learning tools or for analysis.

Core Functionality:

The add-on provides a dialog interface to configure and execute the export process. Users can:

  1. Specify Anki Data Source:

    • Filter notes by note type name (e.g., "Japanese," "Basic").
    • Select the specific field within those notes that contains the words/sentences to process (e.g., "Expression," "Sentence").
    • Set a minimum card interval to only include words from mature cards.
  2. Manage CSV Output:

    • Read Existing CSV: Optionally load an existing "known words" CSV. The add-on expects "Word" and "Source" columns.
    • Operation Mode:
      • Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as anki source).
      • Save As New CSV: Export all processed words to a new CSV file.
    • Automatic Filename Timestamping: Optionally append a _YYYY-MM-DD_HHMMSS timestamp to new CSV filenames.
  3. Advanced Word Processing (especially for Japanese):

    • MeCab Lemmatization:
      • If MeCab (Japanese morphological analyzer) is available, users can choose to lemmatize words from the Anki field (e.g., "ι£ŸγΉγΎγ—γŸ" -> "ι£ŸγΉγ‚‹").
      • Custom Stopwords: Define a list of custom stopwords (lemmas) to be excluded from the export. These can either supplement built-in stopwords (like する, ある) or replace them entirely.
      • Part-of-Speech (POS) Filtering: Common particles, symbols, prefixes, etc., are automatically filtered out during lemmatization.
      • MeCab Test Tool: A built-in utility allows users to test the current MeCab lemmatization settings (stopwords, POS filtering) on sample Japanese text.
    • Basic Word Extraction (if MeCab is unavailable or disabled):
      • Words are extracted by splitting the field content by common delimiters and removing punctuation/HTML.
  4. Dictionary Filtering:

    • Optionally filter the extracted words/lemmas against a user-provided dictionary file (plain text, one word per line). Only words/lemmas present in this dictionary will be included in the final CSV.
  5. Settings Persistence:

    • The dialog remembers the last used settings (paths, filters, options) for convenience.

Key Components:

  • ExportVocabCsvDialog: The main Qt dialog for user interaction and settings.
  • KnownWordsProcessor: Handles the core logic of reading CSVs, fetching/processing Anki data, merging word lists, and writing output CSVs.
  • MeCabProcessor: Encapsulates all MeCab-related functionality, including initialization, lemmatization, POS filtering, stopword management, and self-testing.
  • Graceful degradation if MeCab is not installed or properly configured (lemmatization features will be disabled).

Purpose & Use Cases:

  • Creating a "known words" list for import into reading assistance tools (e.g., browser extensions that highlight known/unknown words on Japanese websites).
  • Tracking vocabulary acquisition over time.
  • Generating word lists for further study or analysis.
  • Migrating vocabulary data between different systems.

tunjan avatar May 08 '25 20:05 tunjan