UTF-8: Was it worth it?

Why do we need UTF-8? Why wasn't ASCII enough? Did it just introduce needless complexity?

Other urls found in this thread:

youtube.com/watch?v=MijmeoH9LT4
en.wikipedia.org/wiki/Emoticon
pl.wikipedia.org/wiki/Grzegorz_Brzęczyszczykiewicz
unicode.org/standard/WhatIsUnicode.html
twitter.com/SFWRedditImages

Emojis and more langs support (but emojis are more important, of course)

...

>Why wasn't ASCII enough?
Ha caмoм дeлe КOИ-8P дoлжeн быть мeждyнapoдным cтaндapтoм.

>Emoji
Americans ruin another Japanese invention.

How else would there be support for crashing iphones remotely by sending a single character?

A кaк жe windows-1251?

And the Japanese ones are so much better too.

Cлишкoм мeйнcтpим.

it was either UTF-8 or eliminate 99% of population

>Emojis
They are called smileys you brain-dead millennial. GTFO.

Fuck off grandpa

Emojis...even though they are cancer, they could've just been encoded using ASCII. Every developer agrees :) = smiley.gif. Plus you've got the benefit of elegant fallback.

Languages. Are there really keyboards that type in squiggly brush-stroke hieroglyphic languages? Seems awfully indulgent that those with intelligent language systems must subsidize archaic, badly designed caveman writing.

Retard.

There's been 8 updates already, current version is UTF-16

>not using bleeding edge UTF-32
it's like you ask for your software to become obsolete in few months

>thinking that one writing system can accommodate every possible language
Grzegorz Brzęczyszczykiewicz is what happens if you try to retardedly force a writing system on to a language that doesn't suit it,

>tfw using greek alphabet daily
>tfw seeing non-ascii characters in plain english texts all the time
I bet you retard enjoyed windows encoding tables

Apparently they exist

zażółć gęślą jaźń

i like my accents thank you

this

*tips fedora* now that's how a neet should talk

バカ外人。。。

>Seems awfully indulgent that those with intelligent language systems must subsidize archaic, badly designed caveman writing.
Yeah, but the latin script is the standard for some idiotic reason.

UTF-8? Sure.
UTF-16? Fuck no.

пидopaхи - нe люди.

Ascii was not enough.

actually here's a good video on it OP

youtube.com/watch?v=MijmeoH9LT4

cocи хyй быдлo!

ASCII can't even encode English punctuation properly.
“ and ” are not the same as "
’ and ‘ are not the same as '
— is not the same as -
… is not the same as ...

>his pathetic earth computers can't process ancient antarean, high psician or meta-cassiopeian neurolinear script

UTF-128 is currently being ratified and will release with support for emojis in every single potential shade of human skin colour (but without any other racially defining features), oh and of course twelve-thousand different variations of the homosexual flag.

UTF-16 is more efficient than UTF-8 in some cases.
スマートカエルのポスター。
In [2]: len("スマートカエルのポスター。".encode('UTF-8'))
Out[2]: 39

In [3]: len("スマートカエルのポスター。".encode('UTF-16'))
Out[3]: 28

>

There's no need for UTF-128 because UTF-8 and UTF-16 are both of variable width.

They are called "emoticons", you monglers.

>en.wikipedia.org/wiki/Emoticon

I prefer the later.

keep in mind you only preserve ameriburgers

やめてください

Did you just get hipster over character encodings? user, try going outside and get some fresh air.

>Grzegorz Brzęczyszczykiewicz is what happens if you try to retardedly force a writing system on to a language that doesn't suit it,
Elaborate.

UTF-8, and all variable-length encodings, are cancer. UTF-32 should be enough.

Slavic languages have sounds which could be hard to express in vanilla Latin, Cyrilic is much more suitable - Гpзeгopз Бpeчишкeвич reads much more natural. I don't know why Polish is using digraphs instead of specialized letters tho.

apparently 1 sized UTF-16 character can encode more symbols that 1-3 sized UTF-8 character? Holy crap.

That's true for everything that fits into the BMP except for Latin, Greek and Cyrillic script. If you're using ASCII characters in Japanese text, you're at a disadvantage against Shift-JIS with either Unicode encoding. Unicode originally was a bigger issue to the Western world which only had used several incompatible 7-bit and 8-bit codes before.

The main advantages of UTF-8 are endian safeness and self-synchronization (a byte value always has a definite position in a multi-byte sequence, and you can tell immediately what is valid UTF-8 input).

No, it's not. It sounds ridiculous.

elaborate
It's either a mess of language specific diagraphs or a mess of language specific diacritics. You might as well make a new writing system if you have to make so much modifications.

How to write ę or ą to be spelled properly in Cyrillic?

>Гpзeгopз Бpeчишкeвич
>Slavic languages
Generalization.

Gr-ziegor-z Briecziszkiewicz. Cyrillic bulshit.

West Slavic bullshit.

>goal of utf is to eliminate duplicate chars from multiple code pages.
>list whitespace in utf-8

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)

>Why wasn't ASCII enough?

Because Western Culture needs to be eradicated.

It might make sense to you as someone who has grown up in such an environment, much like how an english speaker accepts the weird way the language is spelt but diagraphs and diacritics are just a clunky way of forcing Latin onto a language that clearly has more sounds than what Lating can handle.

Também não sei para quê nós vamos usar uma codificação com mais de 8 bits.

But I can write zażółć gęślą jaźń with this bulls it.

ѧ and ѫ, probably.
These letters were used for nasal vowels

>“ and ” are not the same as "
>’ and ‘ are not the same as '
>— is not the same as -
>… is not the same as ...

actually the punctuation overprolification is fucking cancer

>there are different signs for minus (−), hyphen (‐), hyphen-minus (-), fullwidth hyphen-minus (-), hard hyphen (‑), en dash (–), em dash (—), figure dash (‒), horizontal bar (―) etc.
>nobody uses them correctly
>they're spammed all over the place breaking code
>no human eye can distinguish them w/o comparing

what a fucking nightmare

>Elaborate.

pl.wikipedia.org/wiki/Grzegorz_Brzęczyszczykiewicz

Use google translate.

>I don't know why Polish is using digraphs instead of specialized letters tho.

Because when we were baptised and embraced Latin we once and for all decided that we belong to THE WEST.

This has been settled in the X century.

>we decided
lol nobody asked you

Its a complete nightmare. Most brainlets on this board though just get as far as calling the right PHP function.

Imagine if they had to implement a text search / replace function in an actual programming language. Suddenly they would understand.

>reads much more natural.
Because it's what you do everyday.
If you spent your entire life wiping your ass with bare hand toilet paper might sound ridiculous despite easily being the more hygienic and objectively superior method.
There's no merit in having your own special snowflake writing system if the exactly the same can be achieved with a slightly modified common system.

Besides what you wrote reads Grzegorz Brecziszkewicz not Grzegorz Brzęczyszczykiewicz.
You ommitted half of the characters and pretended your system is somehow the superior one like the fucking retard you are.
Cyryllic has no symbols ą, ę, ń, ł, ó, ć and ź.
Cyryllic doesn't accomodate h/ch and u/ó.


Still no idea what you mean. Are tongue twisters unique to Polish?

Who else did. It's not like the czech king stormed the land, grabbed pagan mieszko by his clothes and ordered him to marry his daughter OR ELSE.

actually the Polish poster here (or maybe there's more of us, we're liking rabbits basically everywhere)

>ordered him to marry his daughter OR ELSE.

yeah but Mieszko and co. knew this 'or else' was coming sooner or later

they were wise enough to join the gang early before the gang turned their eyes on our marshlands to make a gang bang

I still use this term too but it doesn't mean anything now

Spurdöländ wöuld be very very säd withöut UTF-8

>hieroglyphic languages
It's a problem for all languages, that have more distinct sounds than English. ASCII is therefore insufficient for most languages using the Latin alphabet. For German ü, ö, ä, ß are missing, for French à, á, â, ç, ô, è, é etc.

Emoticons, emojis and kaomojis are different things you cockmonglers.

Emoticons are composed by multiple characters (usually ASCII), kaomojis are weeb emoticons, emojis are a single Unicode character.

Which are duplicate there?

All the meme letters needed by the language of the West Mongolians (and pretty much the rest of western civilization) are already in ISO-8859-15. It doesn't leave any room for funnies, though.

>Present text to user using "Hello" + U00A0 + "World"
>User searches for "Hello World"
Result not found.

Needless complexity would be dealing with random encodings for different languages, UTF-8 solved it nicely. Even if you believe that non-English speakers should be euthanized, Unicode is still useful for things like math symbols.

>UTF-32
You can encode every possible Unicode codepoint with 21 bits, so using something like "UTF-21" would be more efficient for text files (21 bit values would be stored as 32 bit ints in memory anyway, but it would save space on the disk).
Constant width encodings aren't as useful as you think anyway, since some emojis consist of a modifier and the emoji, meaning you can't just get the length of the text by getting the length of the array, nor can you simply reverse the text by reversing the array of character data.

Those aren't duplicate characters according to the Unicode standard. You may think that's retarded, but it would still be Unicode's fault and not UTF-8's fault.

/Hello\sWorld

>using a no-break-space instead of a space
You're retarded, I see.

Data presented to user was input by other user.

Fair point regarding UTF/Unicode separation but its a bit like TCP/IP separation. Is UTF-8 commonly used without Unicode?

The point still stands, retardinho. Those characters are extremely rare. No one uses them because he accidentally enters it. Your example is as likely to happen as someone writing "Hello°World" and then some other retard searching for "Hello World" and not getting any search results. Those characters have their uses. Just because you're a little retard and never need something, doesn't mean the rest of the world doesn't need them.
Your retardation is on level as this guy's retarded pretension.

>Is UTF-8 commonly used without Unicode?
What does that even mean?

>Is UTF-8 commonly used without Unicode?
UTF literally stands for "Unicode Transformation Format", so it isn't, I'm just saying that it's just a pretty efficient way of encoding Unicode that is backwards compatible with ASCII, so it shouldn't get the blame for Unicode's faults (which aren't really faults anyway, as those spaces serve different purposes).

Wow, PHP programmers are slow. Don't worry Brad, I'll break it down for you.
>User 1 copies and paste something from web into text entry box for application.
>Hits save
>Application dutifully saves "Hello+U00A0+World" into db. Because thats what was copied from webpage.
>User 2 searches for "Hello World"
>No results found.

You are very very slow on the uptake. Get a different job.

Exactly my point Brad.

You just don't get it, do you? He pasted the text with the character deliberately. "Hello+ non-breaking-space + World" isn't the same as "Hello+space+World". Those characters have their use. Just because some initial retard used the wrong character or someone "maliciously" used the wrong one, doesn't mean the standard (Unicode) is broken.

Clearly if you care that much, you'd get a list of all the whitespace characters and perform checks on it if you want to alter a user's intent. You are a smart programmer, surely you know how to do that.

ロマン字嫌い私達の為

> if you care that much, you'd get a list of all the whitespace characters and perform checks on it
The world should do endless work to accommodate my bad designs

.t pottering.

>later
You would be dead too, probably

The user thought it was "Hello World". How would the user visually be able to differentiate between Hello+Space+World and Hello+NBSP+World?

He wouldn't. That's why he should check his sources or use the proper characters when he writes things himself. Those different whitespaces have a purpose, just like other similar looking characters do. The Latin small letter q looks identical to the Cyrillic letter q. What are you going to do when someone pastes text with that? Are you going to eliminate all the letters that look like a small Latin lowercase q?

>has no arguments
>resorts to using g memes as arguments
Consider euthanizing or educating yourself. One of those is hopeless for you, the other might be legally outlawed.

>Latin small letter q looks identical to the Cyrillic letter q
"UNIcode" was supposed to solve exactly that issue, namely end all the redundant chars that existed in previous codepage mappings. So Latin-based languages could share their common letters and only encode the umlauts for germans, accents for frenchies, etc.
Now every language must have a "q" and 20 different whitespaces. Oh well, at least pajeet can write "show me big vagana" in native scribble.

because muh 卐

>"UNIcode" was supposed to solve exactly that issue, namely end all the redundant chars that existed in previous codepage mappings.
Bullshit. It was created because there was no way to cover all of the characters from different languages and to display them reliably. It literally says so on the Unicode [0]website, but you probably know better than the people who created it. Unless you are the person who created or contributed to the Unicode standard. In that case, correct your website.

[0]unicode.org/standard/WhatIsUnicode.html

>For German ü, ö, ä, ß are missing
Not the greatest example though, really, since those characters are literally just abbreviations of ue, oe, ae, and ss.

this, really nice video, plz watch

That's almost harmless. What is particularly fun is the combining characters where ä is not the same as ä. Of course there are standard functions for canonicalization of combining sequences, but that requires pulling in megabytes of Unicode data tables into your program, and at least assumes that everyone else your program communicates with uses the same canonicalization scheme (hint: there are four of them).

>in every single potential shade of human skin colour (but without any other racially defining features)

>Why do we need UTF-8? Why wasn't ASCII enough? Did it just introduce needless complexity?
Because most people don't use a butched Latin alphabet.

Have fun with different byte orderings, though.

>implying compression algorithms wouldn't do a better job

>elegant fallback
What UTF had before browsers replacing UTF emotes with vector ones was fine. They looked good.

>Why do we need [Unicode]?
To represent other languages using a single standard.
>Why wasn't ASCII enough?
Not enough codepoints for many languages
>Did it just introduce needless complexity?
Unicode is as complex as you want it to be. If you want text processing you need to know how to decode, If you want ordering or to know what a given character means you need to take into consideration the locale of the text and so on.

Extra comments:
There was a big mistake in Unicode and that was adding Webdings, Wingdings and other symbols before CJK & other scripts that are currently in use in the real world. Many characters of these scripts would fit under 0xFFFF and the remaining would need just the third byte. With the dumb addition of the symbols, a huge chunk of the character of those scripts require 3, 4 or more characters.

Should I not be able to type © or ≈?

Come on, user.