bert icon indicating copy to clipboard operation
bert copied to clipboard

Sentence Splitting Approach in BERT Preprocessing

Open AliHaiderAhmad001 opened this issue 1 year ago • 0 comments

Hi,

I am very impressed with your work on BERT.

Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.

The case

I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:

  1. In 10 percent of cases the maximum possible number of words is taken (256 words).
  2. In 80 percent of cases it is divided by ., !,; or ?.
  3. In 10 percent of cases, randomly.
def split_sentences(text, delimiters=".!?;", max_words=250):
  # Split sentences based on maximum word count (10% of cases)

  if random.random() < 0.1:
      return split_text_by_maximum_word_count(text, max_words)

  # Split sentences based on common punctuation marks (80% of cases)
  if random.random() < 0.8:
      return split_text_by_punctuation_marks(text, delimiters, max_words)

  # Random splitting (10% of cases)
  if random.random() < 0.1:
      return random_splitting(text, max_words)

The question:

I would like to know if my approach is wrong, how did you separate the sentences in your approach?

Thanks

AliHaiderAhmad001 avatar Oct 17 '23 10:10 AliHaiderAhmad001