sdow icon indicating copy to clipboard operation
sdow copied to clipboard

"Trimming pages file" is parsing records wrong when the title contains parenthesis

Open hut8 opened this issue 1 year ago • 3 comments

I'm trying to import the dump from enwiki-20221001

+----------+--------------------+
| page_id  | page_title         |
+----------+--------------------+
| 71701640 | 104-2,3,(6),(7),11 |
+----------+--------------------+

This ends up creating this line (which has the wrong title, and also has only 2 columns instead of three) in pages.txt.gz:

71701640   104-2,3,(6

Here's some context for surrounding lines:

71701608        Alberta_Sovereignty_Act 0
71701611        Homeland_Defence_Act    0
71701613        Berlin_Nobody   0
71701617        Miss_Grand_Nepal_2022   0
71701639        Pgm2_c  1
71701640   104-2,3,(6
71701649        2022_Binh_Duong_karaoke_bar_fire        0
71701668        Chapel_of_the_Christ,_San_Pablo_del_Monte       0
71701673        Ximena_Aguilera 0
71701676        Wedding_dress_of_Katharine_Worsley      0
71701682        2022–23_Central_Michigan_Chippewas_men\'s_basketball_team       0

I will do some more research on this shortly.

hut8 avatar Oct 29 '22 19:10 hut8