python-chess icon indicating copy to clipboard operation
python-chess copied to clipboard

`chess.pgn.read_headers` inserts empty header entries related to newlines and empty movetext

Open MatijaSi opened this issue 1 year ago • 6 comments

I am trying to parse a largeish (7,000,000 games) pgn using read_headers. However, I only managed to scan 84,039 games before it stopped as if it finished (no error message).

I managed to narrow it down to this testcase:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

while True:
    headers = chess.pgn.read_headers(f)
    print(headers)

    if not headers:
        break

Which prints:

Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0')
Headers()

MatijaSi avatar Jun 07 '24 07:06 MatijaSi

Investigating a bit further, there seems to be some issue related to newlines between games.

For example:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (note the empty Headers() between both "real" games):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 None]

While file from original issue:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (again plenty of empties):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

So my code in original issue is slightly wrong: it looks at headers being false-ish:

if not headers:
    break

instead of comparing them to None:

if headers is None:
    break

However this is probably still bug in library, since empty line probably shouldn't be empty game. Additionaly it's somehow related to movetext being empty, since if we provide it we get different return:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


1. e4 e5 0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to (note that now there is no Headers() between games, but one extra still got appended):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

MatijaSi avatar Jun 07 '24 08:06 MatijaSi

I have also had this problem.

If I put a blank line between the games, it works. So:

Example 1, BAD, does only parse the first game, no blank line between games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

Example 2, GOOD, does parse both games, a blank line between the games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

I think that both examples should work.

tage64 avatar Aug 21 '24 21:08 tage64

This is tricky to deal with for chess.pgn.read_game() with its current interface: It reads the file line by line, without being able to look ahead. And so with the parser at <-

[Header "A"]


1. e4

<-
[Header "B"]

a decision has to be made:

  • Guess that the game contains consecutive empty lines (not allowed!) and will continue. In this example, it would incorrectly consume the first header of the second game, which is bad.
  • Guess that the game is terminated by consecutive empty lines and is just missing a result marker like * or 1-0 (not allowed!). This terminates the game too early in your examples, which is bad.

Currently the parser always does the latter. This is not a bug, because the PGN is invalid anyway, but maybe some heuristics can be added to better deal with it.

Robustly handling all of this would require changing the API, so that the parser can look ahead one line, without necessarily consuming it. Pushing this back to 2.x, for that reason.

niklasf avatar Sep 27 '24 17:09 niklasf

Hey niklasf, here we actually have result marker - we are missing movetext. I guess minimal testcase would be:

[Header "A"]

{ Comment }

0-1

[Header "B"]

1. e4

1-0

I didn't try it though, since I don't have python on this computer.

MatijaSi avatar Oct 16 '24 07:10 MatijaSi

I've written a class that can assist with looking ahead in a PGN without necessarily consuming the line. There are two methods that can be used to address the lookahead difficulties.

  • iterator.pushback(line) puts line at the front of the iterator so that it is returned on the next loop. For example, if the PGN reader comes to a line with header information while scanning a game (if line.startswith("["):), then the current game can be finalized and the header line returned to the iterator to start the next game (iterator.pushback(line)).
  • iterator.lookahead() returns the next line while preserving it for the next loop. Similarly to the above, if iterator.lookahead().startswith("["): can be used to detect the end of a game that is missing an endgame annotation.

Here's the code with some usage example below. Let me know if this could be useful.

from typing import Iterable, Optional

class PreviewIterator:
    def __init__(self, source: Iterable[str]) -> None:
        self.source = iter(source)
        self.putback_line: Optional[str] = None

    def __iter__(self) -> Iterable[str]:
        return self

    def __next__(self) -> str:
        if self.putback_line is not None:
            line = self.putback_line
            self.putback_line = None
            return line
        else:
            return next(self.source)

    def putback(self, line: str) -> None:
        self.putback_line = line

    def lookahead(self) -> Optional[str]:
        try:
            line = next(self)
        except StopIteration:
            return None

        self.putback(line)
        return line


lines = ["first", "second", "third repeat", "fourth"]
line_iterator = PreviewIterator(lines)
for line in line_iterator:
    print(line)
    if line.endswith("repeat"):
        line_iterator.putback(line.removesuffix("repeat"))

print("")

line_iterator_2 = PreviewIterator(lines)
for line in line_iterator_2:
    print(line)
    look_ahead = line_iterator_2.lookahead()
    if look_ahead and look_ahead.endswith("repeat"):
        print("+++")

Output:

first
second
third repeat
third
fourth

first
second
+++
third repeat
fourth

MarkZH avatar Oct 16 '24 11:10 MarkZH

Yes. I think for 2.x I'd like to replace the stateless chess.pgn.read_game(f: file) -> Optional[Game] with something like a stateful

class PgnReader:
    def __init__(self, f: file): ...
    def read_game(self) -> Optional[Game]: ...

that can internally use a PreviewIterator like you suggested, or saves information for the next game, if needed.

niklasf avatar Oct 16 '24 14:10 niklasf