Repeating text due to chap2text

Open jakowisp opened this issue 8 months ago • 1 comments

The following partial output from "--scan" shows the issue. The text preceding the actual section is being repeated. The description is right. debugging shows that the start and end ids are correct.

Part: 1. INTRODUCTION TO SOLUTION ARCHITECTURE Part No: 8 Length: 791 1 INTRODUCTION TO SOLUTION ARCHITECTURE

This book is a foundation-level introduction to the discipline of solution architecture which uses a holistic approach to analyse problems and design solutions using the best available evidence from all relevant s

Part: 1.1 Architecture Part No: 9 Length: 5279 1 INTRODUCTION TO SOLUTION ARCHITECTURE

Part: 1.2 Solution architecture Part No: 10 Length: 9735 1 INTRODUCTION TO SOLUTION ARCHITECTURE

This diff shows how the issue can be resolved in 'chap2text'

556c556,566
<         remove = False
---
>         '''
>         There was an assumption that no elements would occur before the element_id.
>         This resulted in repeated text.
>         '''
>         remove=True

558,559c568,571
<             if not remove and end_element_id is not None and elm.get('id') == end_element_id:
<                 remove = True
---
>             if elm.get('id') == element_id:
>                 remove=False
>             if end_element_id is not None and elm.get('id') == end_element_id:
>                 remove=True

In the existing code only elements at end_element_id and after were removed. Leaving any elements before element_id.

Apr 22 '25 17:04 jakowisp

The code contains a bug where it will skip all text if element_id is None.

Fix that and submit a PR for it. Looks good.

Jun 16 '25 13:06 calledit