kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

The new Chunkers "does not work" - Excel files

Open aropb opened this issue 10 months ago • 8 comments

Context / Scenario

It looks like this is a Cyrillic problem.

MaxTokensPerParagraph=1000 OverlappingTokens=200

I check all the text after the decoder, and everything is fine, for example (xlsx, the same thing happens with other formats):

"| 3856 | ГАРАКОЛОБСКИЙ ДЕТСКИЙ | ..."

Chunk from km-default:

"| 3856 | ГАРАКОЛОБСКИЙЙ ДЕТСКИЙЙ | ..."

I also see that sometimes words are cut off!

Maybe a problem (MarkDownChunker and PlainTextChunker) with the encoding here (! ? or ⁉)?

private static readonly SeparatorTrie s_explicitSeparators = new([
    // Symbol + space
    ". ", ".\t", ".\n", "\n\n", // note: covers also the case of multiple '.' like "....\n"
    "? ", "?\t", "?\n", // note: covers also the case of multiple '?' and '!?' like "?????\n" and "?!?\n"
    "! ", "!\t", "!\n", // note: covers also the case of multiple '!' and '?!' like "!!!\n" and "!?!\n"
    "⁉ ", "⁉\t", "⁉\n",
    "⁈ ", "⁈\t", "⁈\n",
    "⁇ ", "⁇\t", "⁇\n",
    "… ", "…\t", "…\n",
    // Multi-char separators without space, ordered by length
    "!!!!", "????", "!!!", "???", "?!?", "!?!", "!?", "?!", "!!", "??", "....", "...", "..",
    // 1 char separators without space
    ".", "?", "!", "⁉", "⁈", "⁇", "…",
]);

Another important point: for Excel, it is necessary that, if possible, the chunk is trimmed with high priority at the end of row ("\n"). In general, ideally, chunks should consist of complete sentences and lines (row in Excel).

For example "Version 1.1.1" - it is also mistakenly split into parts. The algorithm mistakenly considers this "Version 1." to be the end of a sentence.

For english (cut sentence):


Chunk 1:

" an emerging local plan that has either been submitted for examination or has reached Regulation 18 or Regulation 19 (Town and Country Planning (Local Planning) (England) Regulations 2012) stage, including both a policies map and proposed allocations towards meeting housing need. This provision does not apply to authorities who are not required to demonstrate a housing land supply, as set out These arrangements will apply for a period of two years from the in paragraph 76. publication date of this revision of the Framework. For the purposes of plan-making 227. The policies in the original National Planning Policy Framework published in March 2012 will apply for the purpose of examining plans, where those plans were submitted on or before 24 January 2019. Where such plans are withdrawn or otherwise do not proceed to become part of the development plan, the policies contained in this Framework will apply to any subsequent plan produced for the area concerned. 228. For the purposes of the policy on larger-scale development in paragraph 22, this applies only to plans that have not reached Regulation 19 of the Town and Country Planning (Local Planning) (England) Regulations 2012 (pre-submission) of this Framework was published on 20 stage at the point the previous version 79 As an exception to this, the policy contained in paragraph 76 and the related reference in footnote 8 of this Framework should only be taken into account as a material consideration when dealing with applications made on or after the date of publication of this version of the Framework. 80 Unless these strategic policies have been reviewed and found not to require updating. Where local housing need is used as the basis for assessing whether a four year supply of specific deliverable sites exists, it should be calculated using the standard method set out in national planning guidance. 65July 2021 (for Spatial Development Strategies this would refer to consultation under section 335(2) of the Greater London Authority Act 1999). 229. For the purposes of the policy on renewable and low carbon energy and heat in plans in paragraph 160, this policy does not apply to plans that have reached Regulation 19 of the Town and Country Planning (Local Planning) (England) Regulations 2012 (pre-submission) stage, or that reach this stage within three months of the date of publication of the previous version of this Framework published on 5 September 2023. For Spatial Development Strategies, paragraph 160 does not apply to strategies that have reached consultation under section 335(2) of the Greater London Authority Act 1999 or that reach this stage within three months of the date of publication of the previous version of this Framework published on 5 September 2023. 230. The policies in this Framework (published on 19 December 2023) will apply for the purpose of examining plans, where those plans reach regulation 19 of the Town and Country Planning (Local Planning) (England) Regulations 2012 (pre- submission) stage after 19 March 2024. Plans that reach pre-submission consultation on or before this date will be examined under the relevant previous version of the Framework in accordance with the above arrangements. For Spatial Development Strategies, this Framework applies to strategies that have reached consultation under section 335(2) of the Greater London Authority Act 1999 after 19 March 2024. Strategies that reach this stage on or before this date will be examined under the relevant previous version of the Framework in accordance with the above arrangements. Where plans or strategies are withdrawn or otherwise do not proceed to become part of the development plan, the policies contained in this Framework will apply to any subsequent plan or strategy produced for the area concerned. 231. The Government will continue to explore with individual areas the potential for planning freedoms and flexibilities, for example where this would facilitate an increase in the amount of housing that can be delivered. "

Chunk 2:

"submission) stage after 19 March 2024. Plans that reach pre-submission consultation on or before this date will be examined under the relevant previous version of the Framework in accordance with the above arrangements. For Spatial Development Strategies, this Framework applies to strategies that have reached consultation under section 335(2) of the Greater London Authority Act 1999 after 19 March 2024. Strategies that reach this stage on or before this date will be examined under the relevant previous version of the Framework in accordance with the above arrangements. Where plans or strategies are withdrawn or otherwise do not proceed to become part of the development plan, the policies contained in this Framework will apply to any subsequent plan or strategy produced for the area concerned. 231. The Government will continue to explore with individual areas the potential for planning freedoms and flexibilities, for example where this would facilitate an increase in the amount of housing that can be delivered. 66Annex 2: Glossary Affordable housing: housing for sale or rent, for those whose needs are not met by the market (including housing that provides a subsidised route to home ownership and/or is for essential local workers); and which complies with one or more of the following 81 definitions : a) Affordable housing for rent: meets all of the following conditions: (a) the rent is set in accordance with the Government’s rent policy for Social Rent or Affordable Rent, or is at least 20% below local market rents (including service charges where applicable); (b) the landlord is a registered provider, except where it is included as part of a Build to Rent scheme (in which case the landlord need not be a registered provider); and (c) it includes provisions to remain at an affordable price for future eligible households, or for the subsidy to be recycled for alternative affordable housing provision. For Build to Rent schemes affordable housing for rent is expected to be the normal form of affordable housing provision (and, in this context, is known as Affordable Private Rent). b) Starter homes: is as specified in Sections 2 and 3 of the Housing and Planning Act 2016 and any secondary legislation made under these sections. The definition of a starter home should reflect the meaning set out in statute and any such secondary legislation at the time of plan-preparation or decision-making. Where secondary legislation has the effect of limiting a household’s eligibility to purchase a starter home to those with a particular maximum level of household income, those restrictions should be used. c) Discounted market sales housing: is that sold at a discount of at least 20% below local market value. Eligibility is determined with regard to local incomes and local house prices. Provisions should be in place to ensure housing remains at a discount for future eligible households. d) Other affordable routes to home ownership: is housing provided for sale that provides a route to ownership for those who could not achieve home ownership through the market. It includes shared ownership, relevant equity loans, other low cost homes for sale (at a price equivalent to at least 20% below local market value) and rent to buy (which includes a period of intermediate rent). Where public grant funding is provided, there should be provisions for the homes to remain at an affordable price for future eligible households, or for any receipts to be recycled for alternative affordable housing provision, or refunded to Government or the relevant authority specified in the funding agreement. Air quality management areas: Areas designated by local authorities because they are not likely to achieve national air quality objectives by the relevant deadlines. Ancient or veteran tree: A tree which, because of its age, size and condition, is of exceptional biodiversity, cultural or heritage value. All ancient trees are veteran trees. Not all veteran trees are old enough to be ancient, but are old relative to other trees of the same species. Very few trees of any species reach the ancient life-stage. 81 This definition should be read in conjunction with relevant policy contained in the Affordable Homes Update Written Ministerial Statement published on 24 May 2021. 67Ancient woodland: An area that has been wooded continuously since at least 1600 AD. It includes ancient semi-natural woodland and plantations on ancient woodland sites (PAWS). Annual position statement: A document setting out the 5 year housing land supply position on 1st April each year, prepared by the local planning authority in consultation with developers and others who have an impact on delivery. Archaeological interest: There will be archaeological interest in a heritage asset if it holds, or potentially holds, evidence of past human activity worthy of expert investigation at some point. "

Chunk 3:

" or heritage value. All ancient trees are veteran trees. Not all veteran trees are old enough to be ancient, but are old relative to other trees of the same species. Very few trees of any species reach the ancient life-stage. 81 This definition should be read in conjunction with relevant policy contained in the Affordable Homes Update Written Ministerial Statement published on 24 May 2021. 67Ancient woodland: An area that has been wooded continuously since at least 1600 AD. It includes ancient semi-natural woodland and plantations on ancient woodland sites (PAWS). Annual position statement: A document setting out the 5 year housing land supply position on 1st April each year, prepared by the local planning authority in consultation with developers and others who have an impact on delivery. Archaeological interest: There will be archaeological interest in a heritage asset if it holds, or potentially holds, evidence of past human activity worthy of expert investigation at some point. Article 4 direction: A direction made under Article 4 of the Town and Country Planning (General Permitted Development) (England) Order 2015 which withdraws permitted development rights granted by that Order. Best and most versatile agricultural land: Land in grades 1, 2 and 3a of the Agricultural Land Classification. Brownfield land: See Previously developed land. Brownfield land registers: Registers of previously developed land that local planning authorities consider to be appropriate for residential development, having regard to criteria in the Town and Country Planning (Brownfield Land Registers) Regulations 2017. Local planning authorities will be able to trigger a grant of permission in principle for residential development on suitable sites in their registers where they follow the required procedures. Build to Rent: Purpose built housing that is typically 100% rented out. It can form part of a wider multi-tenure development comprising either flats or houses, but should be on the same site and/or contiguous with the main development. Schemes will usually offer longer tenancy agreements of three years or more, and will typically be professionally managed stock in single ownership and management control. Climate change adaptation: Adjustments made to natural or human systems in response to the actual or anticipated impacts of climate change, to mitigate harm or exploit beneficial opportunities. Climate change mitigation: Action to reduce the impact of human activity on the climate system, primarily through reducing greenhouse gas emissions. Coastal change management area: An area identified in plans as likely to be affected by physical change to the shoreline through erosion, coastal landslip, permanent inundation or coastal accretion. Community forest: An area identified through the England Community Forest Programme to revitalise countryside and green space in and around major conurbations. Community Right to Build Order: An Order made by the local planning authority (under the Town and Country Planning Act 1990) that grants planning permission for a site- specific development proposal or classes of development. 68Community-led developments: A development instigated and taken forward by a not- for-profit organisation set up and run primarily for the purpose of meeting the housing needs of its members and the wider local community, rather than being a primarily commercial enterprise. The organisation is created, managed and democratically controlled by its members. It may take any one of various legal forms including a community land trust, housing co-operative and community benefit society. Membership of the organisation is open to all beneficiaries and prospective beneficiaries of that organisation. The organisation should own, manage or steward the homes in a manner consistent with its purpose, for example through a mutually supported arrangement with a Registered Provider of Social Housing. The benefits of the development to the specified community should be clearly defined and consideration given to how these benefits can be protected over time, including in the event of the organisation being wound up. Competent person (to prepare site investigation information): A person with a recognised relevant qualification, sufficient experience in dealing with the type(s) of pollution or land instability, and membership of a relevant professional organisation. Conservation (for heritage policy): The process of maintaining and managing change to a heritage asset in a way that sustains and, where appropriate, enhances its significance. Decentralised energy: Local renewable and local low carbon energy sources. Deliverable: To be considered deliverable, sites for housing should be available now, offer a suitable location for development now, and be achievable with a realistic prospect that housing will be delivered on the site within five years. "


Thanks.

What happened?

Critical error!

Importance

edge case

Platform, Language, Versions

KM 0.97.250211.1 LLamaSharp 0.21.0 NET 9.0.1

aropb avatar Feb 12 '25 09:02 aropb