java squeaky-clean: change tasks to not include unicode handling

I noticed in the squeaky-clean problem, there's a test as follows:

//  test/java/SqueakyCleanTest.java

    @Test
    public void string_with_no_letters() {
        assertThat(SqueakyClean.clean("\uD83D\uDE00\uD83D\uDE00\uD83D\uDE00")).isEmpty();
    }

However, there's no corresponding instruction to remove "non-standard" characters from the input string, so the test suite defines a different spec than the instructions.

I think the intent of the test is to remove any non-alphanumeric character or underscore from the input string, but I personally feel going too far into the details of Unicode (i.e. what is a "character" anyway?) distracts from the purpose of the exercise and can be discouraging. Perhaps the instructions can be clarified or the test can be removed or ignored.

Tasks

After some discussion (see comment below), an agreement was reached to modify the exercise to not include any unicode handling. Here are the tasks to do this:

[ ] Update the current tasks and their examples to not include non-ascii characters.
[ ] Change the tests to use non-ascii characters too.
[ ] Remove the final task concerning greek letters.

Contributing to this task

If you'd like to contribute to this task, make a comment below saying that you'd like to work on this issue.
After that, feel free to make a PR fixing the issue. Don't forget to link the PR to this issue

Oct 23 '21 17:10 jaywritescode

@ystromm

Oct 27 '21 17:10 ericbalawejder

When reading the instructions it states:

A valid SqueakyClean name is comprised of zero or more letters and underscores.

This tells me that it is comprised of zero or more letters and underscores. This tells me that it does not contain anything other than letters and underscores.

Oct 27 '21 17:10 kotp

@kotp — I agree with your point and ultimately the test suite defines the specs.

But the intent of the exercise is to teach someone new to Java, and possibly new to programming, about string manipulation, and the details of Unicode distract from that instruction. For example, grokking 'g' < 'v' is much more straightforward than grokking 'Ψ' < '😀'.

Oct 28 '21 16:10 jaywritescode

I am not the final say, and I think the test makes sense. But not positive about a change for the written specification, the description.

The concept taught is char and so "What is a character anyway?" is one of the questions that hopefully is answered by this lesson.

I also would say grokking < means that all of the examples possible for something < something_else is as easy to grok once you grok <.

Oct 28 '21 17:10 kotp

This undocumented test is part of a larger issue: the stated goal, written tasks, hints, and tests all seem to disagree on what we're trying to accomplish. If the purpose of clean is to produce strings composed of zero or more letters and underscores, why don't we simply strip the other characters? What is the purpose of the replacements? Why is isWhitespace recommended when we're only instructed to replace spaces? Why remove Greek letters when "àḃç" is passed through unaltered?

Moreover, is it really a good idea to introduce Unicode support alongside chars without discussing supplementary characters, especially when there are tests containing surrogates? If "What is a character anyway?" is the question being asked, it isn't being adequately addressed by this exercise. In my opinion, that question is beyond the scope of simple char manipulation.

My apologies if this is outside the scope of the original issue.

Nov 02 '21 05:11 njhanley

If the exercise is to remain in its current state, an additional instruction needs to be added to the README.md. For example: "Omit all other non alphanumeric characters".

Nov 24 '21 10:11 sonro

possibly new to programming

Just an FYI: teaching "new programmers" is not really a goal. We are not trying to teach people new to programming at exercism. There is (effectively) an expectation that you already understand at least one programming language. Exercism is about teaching fluency - generally so that a programmer in language X can learn language Y and get fluent quickly.

All that being said, the rest of this discussion seems to be somewhat relevant: we appear to be teaching too much at once in this exercise. We probably need to create a separate concept for instruction about things like unicode. The concept exercises are meant to be trivial for someone that is fluent in the language to create the expected solution (ie. the exemplar).

Dec 04 '21 19:12 jmrunkle

OK, proposal:

we simplify squeaky-clean to literally just teach about basic characters (like the letter "A" or a space " ", etc)
we add a new concept / exercise for dealing with code points and other fine nuances relating to unicode

Dec 10 '21 03:12 jmrunkle

Changes to this exercise will be greatly appreciated. This is coming from someone trying to use Exercism to further their knowledge of Java. Upon encountering the squeaky-clean exercise, I almost gave up on using Exercism completely.

Dec 29 '21 15:12 ericjobrien

Thanks for the additional insight. Now we just need someone to contribute such a change. Adding the new concept will probably be its own issue, for this one I think it is enough for us to remove the unicode specific stuff from the existing exercise.

Dec 29 '21 16:12 jmrunkle

@jmrunkle I can update it after my holidays ;)

Dec 30 '21 09:12 AlbusPortucalis

For example: "Omit all other non alphanumeric characters".

One angle that I don't think has been touched on here is that alphanumeric in unicode is a massive set. I assume we mean Latin alphanumerics, so basically the ASCII subset minus special chars.

Otherwise agree with @jmrunkle on this:

for this one I think it is enough for us to remove the unicode specific stuff from the existing exercise.

Dec 30 '21 19:12 barthon-b

Perhaps even more simply stated as English letters and numbers (and possibly whitespace).

Dec 30 '21 19:12 jmrunkle

I agree we should change this exercise according to what is discussed above. I updated the title and the description with a list of tasks and added labels to increase the visibility of the issue.

Jun 19 '22 10:06 andrerfcsantos

This issue has been automatically marked as action/stale because it has not had recent activity. Please update if there are new updates to provide.

Sep 18 '22 04:09 github-actions[bot]

I would like to work on this issue. I have already tried to listen to the tasks and change these thing in the code. I don't know if the changes that i made are sufficient and useful.

May 06 '23 09:05 GitteV-2159432

@andrerfcsantos looking at the discussion above, I'm wondering whether it makes sense to keep the task about control characters, or to remove that as well. If the goal of this concept exercise is to give a basic introduction of characters, maybe it's best to focus on the Latin alphabet, numbers, whitespace and punctuation, and leave things like control characters, unicode etc for a secondary concept exercise.

Sep 21 '23 11:09 sanderploegsma

Hi @sanderploegsma I would like to take on this issue, the scope is to remove the unicode and greek leeters? or do you think this needs a complete reformat?

Jan 24 '24 15:01 manumafe98

@manumafe98 sure, go ahead! As I mentioned above, IMO the exercise should only focus on introducing the char type as a concept, it does not have to handle everything there is to now about chars. This can perhaps be covered in another concept ("advanced chars" or something, idk), or it can be covered by one or more practice exercises.

So I'd remove the following aspects from the exercise:

Control characters
Unicode
Greek letters

Looking at the current instructions, that would leave the following tasks:

Replace any spaces encountered with underscores
Convert kebab-case to camelCase
Omit characters that are not letters (where it should focus only on numbers and special characters like punctuation, no emojis or unicode)

Jan 26 '24 08:01 sanderploegsma

java java copied to clipboard

squeaky-clean: change tasks to not include unicode handling

Tasks

Contributing to this task

java
java copied to clipboard