tools icon indicating copy to clipboard operation
tools copied to clipboard

Fix(html): Handle `<br>` elements to insert line breaks in text

Open Dhruv-Maradiya opened this issue 1 year ago • 6 comments

Fixes #1090 by updating the DOM parser to handle <br> elements and insert line breaks (\n) when converting HTML content to plain text.

Initially, I thought adding a simple condition might not be a reliable solution. So, I decided to check how HTML-to-text conversion is handled in Chromium and found a similar approach. Here's the link.


  • [x] I’ve reviewed the contributor guide and applied the relevant portions to this PR.
Contribution guidelines:

Note that many Dart repos have a weekly cadence for reviewing PRs - please allow for some latency before initial review feedback.

Dhruv-Maradiya avatar Dec 25 '24 12:12 Dhruv-Maradiya

PR Health

Breaking changes :warning:
Package Change Current Version New Version Needed Version Looking good?
html Breaking 0.15.6 0.15.7-wip 0.16.0
Got "0.15.7-wip" expected >= "0.16.0" (breaking changes)
:warning:

This check can be disabled by tagging the PR with skip-breaking-check.

Changelog Entry :heavy_check_mark:
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

This check can be disabled by tagging the PR with skip-changelog-check.

Coverage :heavy_check_mark:
File Coverage
pkgs/html/lib/dom.dart :green_heart: 65 % :arrow_up: 1 %

This check for test coverage is informational (issues shown here will not fail the PR).

This check can be disabled by tagging the PR with skip-coverage-check.

API leaks :warning:

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbol Leaking sources
html HtmlTokenizer html/parser.dart::HtmlParser::tokenizer
html Token tokenizer.dart::HtmlTokenizer
tokenizer.dart::HtmlTokenizer::tokenQueue
tokenizer.dart::HtmlTokenizer::currentToken
tokenizer.dart::HtmlTokenizer::currentToken
token.dart::TagToken
token.dart::DoctypeToken
token.dart::StringToken
tokenizer.dart::HtmlTokenizer::current
token.dart::StartTagToken
token.dart::CommentToken
html/parser.dart::Phase::processComment
html/parser.dart::Phase::processDoctype
token.dart::CharactersToken
html/parser.dart::Phase::processCharacters
token.dart::SpaceCharactersToken
html/parser.dart::Phase::processSpaceCharacters
html/parser.dart::Phase::processStartTag
html/parser.dart::Phase::startTagHtml
token.dart::EndTagToken
html/parser.dart::Phase::processEndTag
html/parser.dart::HtmlParser::inForeignContent::token
html/parser.dart::HtmlParser::parseRCDataRawtext::token
html/parser.dart::BeforeHeadPhase::startTagOther
html/parser.dart::BeforeHeadPhase::endTagImplyHead
html/parser.dart::InHeadPhase::startTagOther
html/parser.dart::InHeadPhase::endTagHtmlBodyBr
html/parser.dart::AfterHeadPhase::startTagOther
html/parser.dart::AfterHeadPhase::endTagHtmlBodyBr
html/parser.dart::InBodyPhase::startTagProcessInHead
html/parser.dart::InBodyPhase::startTagButton
html/parser.dart::InBodyPhase::startTagOther
html/parser.dart::InBodyPhase::endTagHtml
html/parser.dart::InTablePhase::startTagCol
html/parser.dart::InTablePhase::startTagImplyTbody
html/parser.dart::InTablePhase::startTagTable
html/parser.dart::InTablePhase::startTagStyleScript
html/parser.dart::InCaptionPhase::startTagTableElement
html/parser.dart::InCaptionPhase::startTagOther
html/parser.dart::InCaptionPhase::endTagTable
html/parser.dart::InCaptionPhase::endTagOther
html/parser.dart::InColumnGroupPhase::startTagOther
html/parser.dart::InColumnGroupPhase::endTagOther
html/parser.dart::InTableBodyPhase::startTagTableCell
html/parser.dart::InTableBodyPhase::startTagTableOther
html/parser.dart::InTableBodyPhase::startTagOther
html/parser.dart::InTableBodyPhase::endTagTable
html/parser.dart::InTableBodyPhase::endTagOther
html/parser.dart::InRowPhase::startTagTableOther
html/parser.dart::InRowPhase::startTagOther
html/parser.dart::InRowPhase::endTagTable
html/parser.dart::InRowPhase::endTagTableRowGroup
html/parser.dart::InRowPhase::endTagOther
html/parser.dart::InCellPhase::startTagTableOther
html/parser.dart::InCellPhase::startTagOther
html/parser.dart::InCellPhase::endTagImply
html/parser.dart::InCellPhase::endTagOther
html/parser.dart::InSelectPhase::startTagInput
html/parser.dart::InSelectPhase::startTagScript
html/parser.dart::InSelectPhase::startTagOther
html/parser.dart::InSelectInTablePhase::startTagTable
html/parser.dart::InSelectInTablePhase::startTagOther
html/parser.dart::InSelectInTablePhase::endTagTable
html/parser.dart::InSelectInTablePhase::endTagOther
html/parser.dart::AfterBodyPhase::startTagOther
html/parser.dart::AfterBodyPhase::endTagHtml::token
html/parser.dart::AfterBodyPhase::endTagOther
html/parser.dart::InFramesetPhase::startTagNoframes
html/parser.dart::InFramesetPhase::startTagOther
html/parser.dart::AfterFramesetPhase::startTagNoframes
html/parser.dart::AfterAfterBodyPhase::startTagOther
html/parser.dart::AfterAfterFramesetPhase::startTagNoFrames
html HtmlInputStream tokenizer.dart::HtmlTokenizer::stream
html TagToken tokenizer.dart::HtmlTokenizer::currentTagToken
token.dart::StartTagToken
token.dart::EndTagToken
html/parser.dart::InTableBodyPhase::startTagTableOther::token
html/parser.dart::InTableBodyPhase::endTagTable::token
html DoctypeToken tokenizer.dart::HtmlTokenizer::currentDoctypeToken
treebuilder.dart::TreeBuilder::insertDoctype::token
html/parser.dart::Phase::processDoctype::token
html StringToken tokenizer.dart::HtmlTokenizer::currentStringToken
token.dart::StringToken::add
treebuilder.dart::TreeBuilder::insertComment::token
token.dart::CommentToken
token.dart::CharactersToken
token.dart::SpaceCharactersToken
html/parser.dart::InBodyPhase::processSpaceCharactersDropNewline::token
html/parser.dart::InTableTextPhase::characterTokens
html TreeBuilder html/parser.dart::HtmlParser::tree
html/parser.dart::Phase::tree
html/parser.dart::HtmlParser::new::tree
html ActiveFormattingElements treebuilder.dart::TreeBuilder::activeFormattingElements
html StartTagToken treebuilder.dart::TreeBuilder::insertRoot::token
treebuilder.dart::TreeBuilder::createElement::token
treebuilder.dart::TreeBuilder::insertElement::token
treebuilder.dart::TreeBuilder::insertElementNormal::token
treebuilder.dart::TreeBuilder::insertElementTable::token
html/parser.dart::Phase::processStartTag::token
html/parser.dart::Phase::startTagHtml::token
html/parser.dart::HtmlParser::adjustMathMLAttributes::token
html/parser.dart::HtmlParser::adjustSVGAttributes::token
html/parser.dart::HtmlParser::adjustForeignAttributes::token
html/parser.dart::BeforeHeadPhase::startTagHead::token
html/parser.dart::BeforeHeadPhase::startTagOther::token
html/parser.dart::InHeadPhase::startTagHead::token
html/parser.dart::InHeadPhase::startTagBaseLinkCommand::token
html/parser.dart::InHeadPhase::startTagMeta::token
html/parser.dart::InHeadPhase::startTagTitle::token
html/parser.dart::InHeadPhase::startTagNoScriptNoFramesStyle::token
html/parser.dart::InHeadPhase::startTagScript::token
html/parser.dart::InHeadPhase::startTagOther::token
html/parser.dart::AfterHeadPhase::startTagBody::token
html/parser.dart::AfterHeadPhase::startTagFrameset::token
html/parser.dart::AfterHeadPhase::startTagFromHead::token
html/parser.dart::AfterHeadPhase::startTagHead::token
html/parser.dart::AfterHeadPhase::startTagOther::token
html/parser.dart::InBodyPhase::addFormattingElement::token
html/parser.dart::InBodyPhase::startTagProcessInHead::token
html/parser.dart::InBodyPhase::startTagBody::token
html/parser.dart::InBodyPhase::startTagFrameset::token
html/parser.dart::InBodyPhase::startTagCloseP::token
html/parser.dart::InBodyPhase::startTagPreListing::token
html/parser.dart::InBodyPhase::startTagForm::token
html/parser.dart::InBodyPhase::startTagListItem::token
html/parser.dart::InBodyPhase::startTagPlaintext::token
html/parser.dart::InBodyPhase::startTagHeading::token
html/parser.dart::InBodyPhase::startTagA::token
html/parser.dart::InBodyPhase::startTagFormatting::token
html/parser.dart::InBodyPhase::startTagNobr::token
html/parser.dart::InBodyPhase::startTagButton::token
html/parser.dart::InBodyPhase::startTagAppletMarqueeObject::token
html/parser.dart::InBodyPhase::startTagXmp::token
html/parser.dart::InBodyPhase::startTagTable::token
html/parser.dart::InBodyPhase::startTagVoidFormatting::token
html/parser.dart::InBodyPhase::startTagInput::token
html/parser.dart::InBodyPhase::startTagParamSource::token
html/parser.dart::InBodyPhase::startTagHr::token
html/parser.dart::InBodyPhase::startTagImage::token
html/parser.dart::InBodyPhase::startTagIsIndex::token
html/parser.dart::InBodyPhase::startTagTextarea::token
html/parser.dart::InBodyPhase::startTagIFrame::token
html/parser.dart::InBodyPhase::startTagRawtext::token
html/parser.dart::InBodyPhase::startTagOpt::token
html/parser.dart::InBodyPhase::startTagSelect::token
html/parser.dart::InBodyPhase::startTagRpRt::token
html/parser.dart::InBodyPhase::startTagMath::token
html/parser.dart::InBodyPhase::startTagSvg::token
html/parser.dart::InBodyPhase::startTagMisplaced::token
html/parser.dart::InBodyPhase::startTagOther::token
html/parser.dart::InTablePhase::startTagCaption::token
html/parser.dart::InTablePhase::startTagColgroup::token
html/parser.dart::InTablePhase::startTagCol::token
html/parser.dart::InTablePhase::startTagRowGroup::token
html/parser.dart::InTablePhase::startTagImplyTbody::token
html/parser.dart::InTablePhase::startTagTable::token
html/parser.dart::InTablePhase::startTagStyleScript::token
html/parser.dart::InTablePhase::startTagInput::token
html/parser.dart::InTablePhase::startTagForm::token
html/parser.dart::InTablePhase::startTagOther::token
html/parser.dart::InCaptionPhase::startTagTableElement::token
html/parser.dart::InCaptionPhase::startTagOther::token
html/parser.dart::InColumnGroupPhase::startTagCol::token
html/parser.dart::InColumnGroupPhase::startTagOther::token
html/parser.dart::InTableBodyPhase::startTagTr::token
html/parser.dart::InTableBodyPhase::startTagTableCell::token
html/parser.dart::InTableBodyPhase::startTagOther::token
html/parser.dart::InRowPhase::startTagTableCell::token
html/parser.dart::InRowPhase::startTagTableOther::token
html/parser.dart::InRowPhase::startTagOther::token
html/parser.dart::InCellPhase::startTagTableOther::token
html/parser.dart::InCellPhase::startTagOther::token
html/parser.dart::InSelectPhase::startTagOption::token
html/parser.dart::InSelectPhase::startTagOptgroup::token
html/parser.dart::InSelectPhase::startTagSelect::token
html/parser.dart::InSelectPhase::startTagInput::token
html/parser.dart::InSelectPhase::startTagScript::token
html/parser.dart::InSelectPhase::startTagOther::token
html/parser.dart::InSelectInTablePhase::startTagTable::token
html/parser.dart::InSelectInTablePhase::startTagOther::token
html/parser.dart::InForeignContentPhase::adjustSVGTagNames::token
html/parser.dart::AfterBodyPhase::startTagOther::token
html/parser.dart::InFramesetPhase::startTagFrameset::token
html/parser.dart::InFramesetPhase::startTagFrame::token
html/parser.dart::InFramesetPhase::startTagNoframes::token
html/parser.dart::InFramesetPhase::startTagOther::token
html/parser.dart::AfterFramesetPhase::startTagNoframes::token
html/parser.dart::AfterFramesetPhase::startTagOther::token
html/parser.dart::AfterAfterBodyPhase::startTagOther::token
html/parser.dart::AfterAfterFramesetPhase::startTagNoFrames::token
html/parser.dart::AfterAfterFramesetPhase::startTagOther::token
html TagAttribute token.dart::StartTagToken::attributeSpans
html CommentToken html/parser.dart::Phase::processComment::token
html CharactersToken html/parser.dart::Phase::processCharacters::token
html/parser.dart::InTablePhase::insertText::token
html SpaceCharactersToken html/parser.dart::Phase::processSpaceCharacters::token
html EndTagToken html/parser.dart::Phase::processEndTag::token
html/parser.dart::Phase::popOpenElementsUntil::token
html/parser.dart::BeforeHeadPhase::endTagImplyHead::token
html/parser.dart::BeforeHeadPhase::endTagOther::token
html/parser.dart::InHeadPhase::endTagHead::token
html/parser.dart::InHeadPhase::endTagHtmlBodyBr::token
html/parser.dart::InHeadPhase::endTagOther::token
html/parser.dart::AfterHeadPhase::endTagHtmlBodyBr::token
html/parser.dart::AfterHeadPhase::endTagOther::token
html/parser.dart::InBodyPhase::endTagP::token
html/parser.dart::InBodyPhase::endTagBody::token
html/parser.dart::InBodyPhase::endTagHtml::token
html/parser.dart::InBodyPhase::endTagBlock::token
html/parser.dart::InBodyPhase::endTagForm::token
html/parser.dart::InBodyPhase::endTagListItem::token
html/parser.dart::InBodyPhase::endTagHeading::token
html/parser.dart::InBodyPhase::endTagFormatting::token
html/parser.dart::InBodyPhase::endTagAppletMarqueeObject::token
html/parser.dart::InBodyPhase::endTagBr::token
html/parser.dart::InBodyPhase::endTagOther::token
html/parser.dart::TextPhase::endTagScript::token
html/parser.dart::TextPhase::endTagOther::token
html/parser.dart::InTablePhase::endTagTable::token
html/parser.dart::InTablePhase::endTagIgnore::token
html/parser.dart::InTablePhase::endTagOther::token
html/parser.dart::InCaptionPhase::endTagCaption::token
html/parser.dart::InCaptionPhase::endTagTable::token
html/parser.dart::InCaptionPhase::endTagIgnore::token
html/parser.dart::InCaptionPhase::endTagOther::token
html/parser.dart::InColumnGroupPhase::endTagColgroup::token
html/parser.dart::InColumnGroupPhase::endTagCol::token
html/parser.dart::InColumnGroupPhase::endTagOther::token
html/parser.dart::InTableBodyPhase::endTagTableRowGroup::token
html/parser.dart::InTableBodyPhase::endTagIgnore::token
html/parser.dart::InTableBodyPhase::endTagOther::token
html/parser.dart::InRowPhase::endTagTr::token
html/parser.dart::InRowPhase::endTagTable::token
html/parser.dart::InRowPhase::endTagTableRowGroup::token
html/parser.dart::InRowPhase::endTagIgnore::token
html/parser.dart::InRowPhase::endTagOther::token
html/parser.dart::InCellPhase::endTagTableCell::token
html/parser.dart::InCellPhase::endTagIgnore::token
html/parser.dart::InCellPhase::endTagImply::token
html/parser.dart::InCellPhase::endTagOther::token
html/parser.dart::InSelectPhase::endTagOption::token
html/parser.dart::InSelectPhase::endTagOptgroup::token
html/parser.dart::InSelectPhase::endTagSelect::token
html/parser.dart::InSelectPhase::endTagOther::token
html/parser.dart::InSelectInTablePhase::endTagTable::token
html/parser.dart::InSelectInTablePhase::endTagOther::token
html/parser.dart::AfterBodyPhase::endTagOther::token
html/parser.dart::InFramesetPhase::endTagFrameset::token
html/parser.dart::InFramesetPhase::endTagOther::token
html/parser.dart::AfterFramesetPhase::endTagHtml::token
html/parser.dart::AfterFramesetPhase::endTagOther::token

This check can be disabled by tagging the PR with skip-leaking-check.

License Headers :warning:
// Copyright (c) 2025, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
pkgs/html/lib/dom.dart
pkgs/html/test/parser_feature_test.dart

All source files should start with a license header.

Unrelated files missing license headers
Files
pkgs/bazel_worker/benchmark/benchmark.dart
pkgs/benchmark_harness/integration_test/perf_benchmark_test.dart
pkgs/boolean_selector/example/example.dart
pkgs/clock/lib/clock.dart
pkgs/clock/lib/src/clock.dart
pkgs/clock/lib/src/default.dart
pkgs/clock/lib/src/stopwatch.dart
pkgs/clock/lib/src/utils.dart
pkgs/clock/test/clock_test.dart
pkgs/clock/test/default_test.dart
pkgs/clock/test/stopwatch_test.dart
pkgs/clock/test/utils.dart
pkgs/coverage/lib/src/coverage_options.dart
pkgs/html/example/main.dart
pkgs/html/lib/dom_parsing.dart
pkgs/html/lib/html_escape.dart
pkgs/html/lib/parser.dart
pkgs/html/lib/src/constants.dart
pkgs/html/lib/src/encoding_parser.dart
pkgs/html/lib/src/html_input_stream.dart
pkgs/html/lib/src/list_proxy.dart
pkgs/html/lib/src/query_selector.dart
pkgs/html/lib/src/token.dart
pkgs/html/lib/src/tokenizer.dart
pkgs/html/lib/src/treebuilder.dart
pkgs/html/lib/src/utils.dart
pkgs/html/test/dom_test.dart
pkgs/html/test/parser_test.dart
pkgs/html/test/query_selector_test.dart
pkgs/html/test/selectors/level1_baseline_test.dart
pkgs/html/test/selectors/level1_lib.dart
pkgs/html/test/selectors/selectors.dart
pkgs/html/test/support.dart
pkgs/html/test/tokenizer_test.dart
pkgs/html/test/trie_test.dart
pkgs/html/tool/generate_trie.dart
pkgs/pubspec_parse/test/git_uri_test.dart
pkgs/stack_trace/example/example.dart
pkgs/watcher/test/custom_watcher_factory_test.dart
pkgs/yaml_edit/example/example.dart

This check can be disabled by tagging the PR with skip-license-check.

github-actions[bot] avatar Dec 30 '24 01:12 github-actions[bot]

Hey, thanks for reviewing this! 🙌 It’s been a few months since I worked on it, and I was still getting familiar with the codebase at the time — so I’ll need to refresh myself on the changes. I'll take a look as soon as I can. Appreciate your feedback!

Dhruv-Maradiya avatar Apr 19 '25 07:04 Dhruv-Maradiya

@Dhruv-Maradiya Just a friendly ping as I am looking through PRs - is there intention to land this?

mosuem avatar Oct 07 '25 12:10 mosuem

Hey @mosuem, sorry for the delay! I’ll try to wrap this up ASAP, most likely today.

Dhruv-Maradiya avatar Oct 08 '25 04:10 Dhruv-Maradiya

Friendly ping :) (No pressure, just happened to walk by this tab in my browser)

mosuem avatar Oct 13 '25 15:10 mosuem

Package publishing

Package Version Status Publish tag (post-merge)
package:bazel_worker 1.1.4 already published at pub.dev
package:benchmark_harness 2.4.0-wip WIP (no publish necessary)
package:boolean_selector 2.1.2 already published at pub.dev
package:browser_launcher 1.1.3 already published at pub.dev
package:cli_config 0.2.1-wip WIP (no publish necessary)
package:cli_util 0.5.0-wip WIP (no publish necessary)
package:clock 1.1.3-wip WIP (no publish necessary)
package:code_builder 4.11.0 already published at pub.dev
package:coverage 1.15.0 already published at pub.dev
package:csslib 1.0.2 already published at pub.dev
package:extension_discovery 2.1.0 already published at pub.dev
package:file 7.0.2-wip WIP (no publish necessary)
package:file_testing 3.1.0-wip WIP (no publish necessary)
package:glob 2.1.3 already published at pub.dev
package:graphs 2.3.3-wip WIP (no publish necessary)
package:html 0.15.7-wip WIP (no publish necessary)
package:io 1.1.0-wip WIP (no publish necessary)
package:json_rpc_2 4.0.0 already published at pub.dev
package:markdown 7.3.1-wip WIP (no publish necessary)
package:mime 2.0.0 already published at pub.dev
package:oauth2 2.0.4 ready to publish oauth2-v2.0.4
package:package_config 2.3.0-wip WIP (no publish necessary)
package:pool 1.5.2 already published at pub.dev
package:process 5.0.5 already published at pub.dev
package:pub_semver 2.2.0 already published at pub.dev
package:pubspec_parse 1.5.1-wip WIP (no publish necessary)
package:source_map_stack_trace 2.1.3-wip WIP (no publish necessary)
package:source_maps 0.10.14-wip WIP (no publish necessary)
package:source_span 1.10.1 already published at pub.dev
package:sse 4.1.8 already published at pub.dev
package:stack_trace 1.12.1 already published at pub.dev
package:stream_channel 2.1.4 already published at pub.dev
package:stream_transform 2.1.2-wip WIP (no publish necessary)
package:string_scanner 1.4.1 already published at pub.dev
package:term_glyph 1.2.3-wip WIP (no publish necessary)
package:test_reflective_loader 0.4.0 already published at pub.dev
package:timing 1.0.2 already published at pub.dev
package:unified_analytics 8.0.6 ready to publish unified_analytics-v8.0.6
package:watcher 1.1.5-wip WIP (no publish necessary)
package:yaml 3.1.3 already published at pub.dev
package:yaml_edit 2.2.2 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

github-actions[bot] avatar Oct 31 '25 13:10 github-actions[bot]

@HosseinYousefi could you take another look?

mosuem avatar Dec 12 '25 13:12 mosuem