incubator-answer icon indicating copy to clipboard operation
incubator-answer copied to clipboard

fix/slugify-non-ascii

Open hbsciw opened this issue 1 year ago • 3 comments

This pull request fixes an issue where questions with full Persian titles were causing a "many redirect error" when trying to access them through the link. The issue was caused by the slugify library returning a blank string for the title of these questions. With this fix, the slugify library has been updated to return a proper string for the title, resolving the issue and allowing for smooth access to all questions, regardless of their title.

hbsciw avatar Feb 26 '24 15:02 hbsciw

Firstly, thank you for your contribution. There are a few questions below:

  1. Could you tell us some test cases for Persian to help us reproduce the problem?
  2. If we replace the slugify library directly, it may affect the links that are already generated. I haven't actually tested the various scenarios, but it would be a lot of work. My personal suggestion would be to do post processing separately for specific issues.
  3. Do not change the maximum length of the limit, our current maximum is 150
  4. Do not include debug information in the code fmt.Println

LinkinStars avatar Feb 27 '24 02:02 LinkinStars

@LinkinStars Thank you for response. I solved 3 and 4.

  1. A post with this title has many redirect problem. title= چگونه می‌توانم مساله گزارش شده را ارزبای کنم؟ And post with this title hasn't problem. title = چگونه می‌توانم Bug گزارش شده را ارزیابی کنم؟

  2. I think since the slugs aren't saved, it won't be a problem. I also tested a few things and it didn't seem to be a problem. Have you any specific scenario in your mind? Can you tell us your exact solution?

hbsciw avatar Feb 27 '24 07:02 hbsciw

@LinkinStars Thank you for response. I solved 3 and 4.

  1. A post with this title has many redirect problem. title= چگونه می‌توانم مساله گزارش شده را ارزبای کنم؟ And post with this title hasn't problem. title = چگونه می‌توانم Bug گزارش شده را ارزیابی کنم؟
  2. I think since the slugs aren't saved, it won't be a problem. I also tested a few things and it didn't seem to be a problem. Have you any specific scenario in your mind? Can you tell us your exact solution?

Thank you for the examples. Firstly, you're right that using Persian as a title does cause problems and can lead to redirection errors. The reason is because the url address of the title is empty after formatting. After slugify.Slugify(title) the title is empty.

I have two better solutions to this problem that we can discuss.

solution 1

I don't know if you noticed that there is a convertChinese method in the UrlTitle method. Yes, in fact, Chinese has the same problem. It then relies on this method to identify and convert to English. So maybe we can write a similar method to handle this.

solution 2

You are right, the address behind this url is not saved. It is also only for SEO optimisation. If the language can't be converted to English on the URL, then it doesn't really make a lot of sense for SEO. So I would recommend using a shorter address for access to avoid this problem.

image

BTW

By the way, you don't really have to abandon slugify to achieve the capabilities you need. You can just replace the IsValidCharacterChecker method. Like in the example below.

package main

import (
	"fmt"
	"strings"
	"unicode"

	"github.com/Machiel/slugify"
)

func IsValidCharacterChecker(r rune) bool {
	// TODO: Implement your own character validation
	return true
}

func main() {
	originalTitle := " چگونه می‌توانم مساله گزارش شده را ارزبای کنم؟"
	slug := slugify.New(slugify.Configuration{IsValidCharacterChecker: IsValidCharacterChecker})
	title := slug.Slugify(originalTitle)
	fmt.Println(title)
}

LinkinStars avatar Feb 28 '24 09:02 LinkinStars

https://github.com/apache/incubator-answer/blob/9681c026adfe4ddd0c8ae158b2f95e76e414df9f/pkg/htmltext/htmltext.go#L94

Our final solution is to use a fixed description(topic) when the language cannot be parsed.

LinkinStars avatar May 06 '24 09:05 LinkinStars