dictionary-builder icon indicating copy to clipboard operation
dictionary-builder copied to clipboard

Initial set-up

Open soliviantar opened this issue 2 years ago • 3 comments

Hi. I am trying to get a dictionary from the eswiktionary dump. But I am a stranger to coding, so I am probably doing lots of stuff wrong.

I downloaded the dump and created the executable, but I get an error every time I run it. I think I'm not setting the Settings.toml file correctly or that maybe I should be putting it somewhere else.

This is the output of the executable (in PowerShell 7, as admin):

PS D:\IDM\dictionary-builder-master\target\release> .\dictionary-builder.exe
dictionnary-builder will use D:\IDM\dictionary-builder-master\target\release\dump\eswiktionary-latest-pages-articles-multistream.xml
thread 'main' panicked at src\main.rs:59:54:
Unable to create file: Os { code: 5, kind: PermissionDenied, message: "Acceso denegado." }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PS D:\IDM\dictionary-builder-master\target\release> .\dictionary-builder.exe RUST_BACKTRACE=1
thread 'main' panicked at src\main.rs:25:79:
called `Result::unwrap()` on an `Err` value: configuration file "RUST_BACKTRACE=1Settings.toml" not found
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PS D:\IDM\dictionary-builder-master\target\release>`

This is my Settings.toml (which I put inside the \release\ folder now):

root="D:\\IDM\\dictionary-builder-master\\target\\release\\dico"
words_file="D:\\IDM\\dictionary-builder-master\\target\\release\\dico\\words"
excluded_words_file="D:\\IDM\\dictionary-builder-master\\target\\release\\dico\\excluded"
xml_dump="D:\\IDM\\dictionary-builder-master\\target\\release\\dump\\eswiktionary-latest-pages-articles-multistream.xml"
with_definition = true
expression = true
language_filter = true
language = "Spanish"
language_short = "es"

Any help would be appreciated.

soliviantar avatar Oct 20 '23 21:10 soliviantar

You are not doing anything wrong, there are just missing/misleading instructions in the readme, the root dico folder must be created by you before running the program, I've added that to the readme. By the way I have also add a fix to deal properly with the spanish dump. Make sure you update your program with it. And last but not least I had also add a warning section to precise what can be expected from dictionnary-builder to avoid disappointments. If all goes well, with the latest eswiktionary-latest-pages-articles-multistream.xml you should end up with :
[INFO dictionary_builder] total number of entries:819815 [INFO dictionary_builder] total number of removed entries:141492`

newca12 avatar Oct 21 '23 13:10 newca12

Oh, ok, thanks! I will try that then. I had created the disco folder somewhere, I believe. I'll check it again.

Also, having an example of some extracted data on the readme would be good.

By the way, what does "expression" in the settings mean?

soliviantar avatar Oct 21 '23 13:10 soliviantar

Basically if expression is set to true an entry in the dump (a potential word) with a space in it will be considered as an expression which can be very wrong with some languages. If set to false all these entries with spaces are simply discarded.

newca12 avatar Oct 21 '23 13:10 newca12