java
java copied to clipboard
squeaky-clean: change tasks to not include unicode handling
I noticed in the squeaky-clean
problem, there's a test as follows:
// test/java/SqueakyCleanTest.java
@Test
public void string_with_no_letters() {
assertThat(SqueakyClean.clean("\uD83D\uDE00\uD83D\uDE00\uD83D\uDE00")).isEmpty();
}
However, there's no corresponding instruction to remove "non-standard" characters from the input string, so the test suite defines a different spec than the instructions.
I think the intent of the test is to remove any non-alphanumeric character or underscore from the input string, but I personally feel going too far into the details of Unicode (i.e. what is a "character" anyway?) distracts from the purpose of the exercise and can be discouraging. Perhaps the instructions can be clarified or the test can be removed or ignored.
Tasks
After some discussion (see comment below), an agreement was reached to modify the exercise to not include any unicode handling. Here are the tasks to do this:
- [ ] Update the current tasks and their examples to not include non-ascii characters.
- [ ] Change the tests to use non-ascii characters too.
- [ ] Remove the final task concerning greek letters.
Contributing to this task
- If you'd like to contribute to this task, make a comment below saying that you'd like to work on this issue.
- After that, feel free to make a PR fixing the issue. Don't forget to link the PR to this issue
@ystromm
When reading the instructions it states:
A valid SqueakyClean name is comprised of zero or more letters and underscores.
This tells me that it is comprised of zero or more letters and underscores. This tells me that it does not contain anything other than letters and underscores.
@kotp — I agree with your point and ultimately the test suite defines the specs.
But the intent of the exercise is to teach someone new to Java, and possibly new to programming, about string manipulation, and the details of Unicode distract from that instruction. For example, grokking 'g' < 'v'
is much more straightforward than grokking 'Ψ' < '😀'
.
I am not the final say, and I think the test makes sense. But not positive about a change for the written specification, the description.
The concept taught is char
and so "What is a character anyway?" is one of the questions that hopefully is answered by this lesson.
I also would say grokking <
means that all of the examples possible for something < something_else
is as easy to grok once you grok <
.
This undocumented test is part of a larger issue: the stated goal, written tasks, hints, and tests all seem to disagree on what we're trying to accomplish. If the purpose of clean
is to produce strings composed of zero or more letters and underscores, why don't we simply strip the other characters? What is the purpose of the replacements? Why is isWhitespace
recommended when we're only instructed to replace spaces? Why remove Greek letters when "àḃç"
is passed through unaltered?
Moreover, is it really a good idea to introduce Unicode support alongside char
s without discussing supplementary characters, especially when there are tests containing surrogates? If "What is a character anyway?" is the question being asked, it isn't being adequately addressed by this exercise. In my opinion, that question is beyond the scope of simple char
manipulation.
My apologies if this is outside the scope of the original issue.
If the exercise is to remain in its current state, an additional instruction needs to be added to the README.md
. For example: "Omit all other non alphanumeric characters".
possibly new to programming
Just an FYI: teaching "new programmers" is not really a goal. We are not trying to teach people new to programming at exercism. There is (effectively) an expectation that you already understand at least one programming language. Exercism is about teaching fluency - generally so that a programmer in language X can learn language Y and get fluent quickly.
All that being said, the rest of this discussion seems to be somewhat relevant: we appear to be teaching too much at once in this exercise. We probably need to create a separate concept for instruction about things like unicode. The concept exercises are meant to be trivial for someone that is fluent in the language to create the expected solution (ie. the exemplar).
OK, proposal:
- we simplify squeaky-clean to literally just teach about basic characters (like the letter "A" or a space " ", etc)
- we add a new concept / exercise for dealing with code points and other fine nuances relating to unicode
Changes to this exercise will be greatly appreciated. This is coming from someone trying to use Exercism to further their knowledge of Java. Upon encountering the squeaky-clean exercise, I almost gave up on using Exercism completely.
Thanks for the additional insight. Now we just need someone to contribute such a change. Adding the new concept will probably be its own issue, for this one I think it is enough for us to remove the unicode specific stuff from the existing exercise.
@jmrunkle I can update it after my holidays ;)
For example: "Omit all other non alphanumeric characters".
One angle that I don't think has been touched on here is that alphanumeric in unicode is a massive set. I assume we mean Latin alphanumerics, so basically the ASCII subset minus special chars.
Otherwise agree with @jmrunkle on this:
for this one I think it is enough for us to remove the unicode specific stuff from the existing exercise.
Perhaps even more simply stated as English letters and numbers (and possibly whitespace).
I agree we should change this exercise according to what is discussed above. I updated the title and the description with a list of tasks and added labels to increase the visibility of the issue.
This issue has been automatically marked as action/stale
because it has not had recent activity. Please update if there are new updates to provide.
I would like to work on this issue. I have already tried to listen to the tasks and change these thing in the code. I don't know if the changes that i made are sufficient and useful.
@andrerfcsantos looking at the discussion above, I'm wondering whether it makes sense to keep the task about control characters, or to remove that as well. If the goal of this concept exercise is to give a basic introduction of characters, maybe it's best to focus on the Latin alphabet, numbers, whitespace and punctuation, and leave things like control characters, unicode etc for a secondary concept exercise.
Hi @sanderploegsma I would like to take on this issue, the scope is to remove the unicode and greek leeters? or do you think this needs a complete reformat?
@manumafe98 sure, go ahead! As I mentioned above, IMO the exercise should only focus on introducing the char
type as a concept, it does not have to handle everything there is to now about chars. This can perhaps be covered in another concept ("advanced chars" or something, idk), or it can be covered by one or more practice exercises.
So I'd remove the following aspects from the exercise:
- Control characters
- Unicode
- Greek letters
Looking at the current instructions, that would leave the following tasks:
- Replace any spaces encountered with underscores
- Convert kebab-case to camelCase
- Omit characters that are not letters (where it should focus only on numbers and special characters like punctuation, no emojis or unicode)