sdow
sdow copied to clipboard
"Trimming pages file" is parsing records wrong when the title contains parenthesis
I'm trying to import the dump from enwiki-20221001
+----------+--------------------+
| page_id | page_title |
+----------+--------------------+
| 71701640 | 104-2,3,(6),(7),11 |
+----------+--------------------+
This ends up creating this line (which has the wrong title, and also has only 2 columns instead of three) in pages.txt.gz:
71701640 104-2,3,(6
Here's some context for surrounding lines:
71701608 Alberta_Sovereignty_Act 0
71701611 Homeland_Defence_Act 0
71701613 Berlin_Nobody 0
71701617 Miss_Grand_Nepal_2022 0
71701639 Pgm2_c 1
71701640 104-2,3,(6
71701649 2022_Binh_Duong_karaoke_bar_fire 0
71701668 Chapel_of_the_Christ,_San_Pablo_del_Monte 0
71701673 Ximena_Aguilera 0
71701676 Wedding_dress_of_Katharine_Worsley 0
71701682 2022–23_Central_Michigan_Chippewas_men\'s_basketball_team 0
I will do some more research on this shortly.