RegulatedNoise icon indicating copy to clipboard operation
RegulatedNoise copied to clipboard

Define a list of Station Name Suffixes, and use this to "intelligently" insert missing spaces into station names?

Open stringandstickytape opened this issue 10 years ago • 4 comments

stringandstickytape avatar Jan 12 '15 12:01 stringandstickytape

This might be harder than expected. For instance, Maddavo's stations file lists:

LUYTEN 674-15 => Nobleport

If that's correct, we have no way of knowing that this is a one-word station name, and doesn't have the suffix " Port". But maybe this is a best-effort algorithm that can't get it right every time. Or maybe the correct fix is to keep updating station.csv and hope the problem falls away over time.

This list of station name suffixes was extracted from Maddavo's stations file:

Anderton Andrade Apology Arena Asylum Ayres Base Beacon Camp Centre Chernobyl City Claim Coliseum Colony Co-Operative Cousens Depot Dive Dixon Doc Dock Enterprise Escape Estate Exchange Exile Eyrie Folly Fort Foundation Freeport Gambit Gate Gateway Goose Halt Ham Hanger Hangout Harrison Haven Hideout High Hold Holdings Holm Home Hope Horizons Hospital Hq Hub Inheritance Installation Jao Klarix Lab Laboratory Lambada Landing Lane Legacy Lincoln Lofthus Lucas Manoevre Manwaring Market Masters Matt Mausoleum Memorial Mine Mines Mojo Mortuary Nest Orbital Orbiter Outpost Owl Park Phoenix Plant Platform Point Port Post Pride Principality Progress Prospect Reach Refinery Reformatory Relay Research Reserve Rest Retreat Ring Sanctuary Scott Settlement Shipyard Silo Spaceport Station Stop Survey Terminal Thiemann Town Vision Vista Wart Way Works Yola Young

stringandstickytape avatar Jan 12 '15 13:01 stringandstickytape

Code to extract suffixes and prefixes from Maddavo's data:

        List<string> suffixes = new List<string>();
        List<string> prefixes = new List<string>();

        StreamReader reader = File.OpenText(".//station.csv");

        reader.ReadLine();

        while (!reader.EndOfStream)
        {
            var line = reader.ReadLine();
            var values = line.Split(',');
            var suffix = values[1];
            string prefix = "";

            suffix = suffix.Substring(1, suffix.Length - 2);

            if (!suffix.Contains(' '))
            {
                prefixes.Add(suffix);
            }
            else
            {
                prefixes.Add(suffix.Substring(0,suffix.LastIndexOf(' ')));
                suffix = suffix.Substring(suffix.LastIndexOf(' ') + 1);
                suffixes.Add(suffix);
                if(suffix.Contains("Wagar"))
                    Debug.WriteLine("!");
            }
        }
        reader.Close();

        using(var file = new System.IO.StreamWriter(".//suffixes.txt"))
        {
            suffixes = suffixes.Distinct().OrderBy(x => x).ToList();
            foreach (var x in suffixes)
                file.WriteLine(x);
            file.Close();
        }

        using (var file = new System.IO.StreamWriter(".//prefixes.txt"))
        {
            prefixes = prefixes.Distinct().OrderBy(x => x).ToList();
            foreach (var x in prefixes)
                file.WriteLine(x);
            file.Close();
        }

stringandstickytape avatar Jan 12 '15 13:01 stringandstickytape

Hm. now that dumb bug is fixed, we should reassess, This may not be necessary at all, Tesseract is pretty good at getting the spaces right if no-one Replaces them back out again...

stringandstickytape avatar Jan 12 '15 22:01 stringandstickytape

I've still had some stations missing the spaces, but now it's only about 5-10% of the time rather than 70%.

Lknechtli avatar Jan 13 '15 12:01 Lknechtli