UTF-8: Was it worth it?

Question

UTF-8: Was it worth it?

Robert Campbell

Why do we need UTF-8? Why wasn't ASCII enough? Did it just introduce needless complexity?

February 20, 2018 - 18:04

Other urls found in this thread:

youtube.com/watch?v=MijmeoH9LT4
en.wikipedia.org/wiki/Emoticon
pl.wikipedia.org/wiki/Grzegorz_Brzęczyszczykiewicz
unicode.org/standard/WhatIsUnicode.html
twitter.com/SFWRedditImages

Isaac Murphy

Emojis and more langs support (but emojis are more important, of course)

February 20, 2018 - 18:08

Benjamin James

...

February 20, 2018 - 18:09

Angel Wood

>Why wasn't ASCII enough?
Ha caмoм дeлe КOИ-8P дoлжeн быть мeждyнapoдным cтaндapтoм.

February 20, 2018 - 18:10

Jaxson Perez

>Emoji
Americans ruin another Japanese invention.

February 20, 2018 - 18:10

Cooper Rivera

How else would there be support for crashing iphones remotely by sending a single character?

February 20, 2018 - 18:11

Owen Wilson

A кaк жe windows-1251?

February 20, 2018 - 18:11

Jeremiah Reyes

And the Japanese ones are so much better too.

February 20, 2018 - 18:12

Nicholas White

Cлишкoм мeйнcтpим.

February 20, 2018 - 18:12

Christopher Gonzalez

it was either UTF-8 or eliminate 99% of population

February 20, 2018 - 18:12

Elijah James

>Emojis
They are called smileys you brain-dead millennial. GTFO.

February 20, 2018 - 18:13

Angel Reyes

Fuck off grandpa

February 20, 2018 - 18:14

Ryder Cook

Emojis...even though they are cancer, they could've just been encoded using ASCII. Every developer agrees :) = smiley.gif. Plus you've got the benefit of elegant fallback.

Languages. Are there really keyboards that type in squiggly brush-stroke hieroglyphic languages? Seems awfully indulgent that those with intelligent language systems must subsidize archaic, badly designed caveman writing.

February 20, 2018 - 18:15

Gabriel Rodriguez

Retard.

February 20, 2018 - 18:16

Ethan Torres

There's been 8 updates already, current version is UTF-16

February 20, 2018 - 18:32

Carter Taylor

>not using bleeding edge UTF-32
it's like you ask for your software to become obsolete in few months

February 20, 2018 - 18:36

Leo Brown

>thinking that one writing system can accommodate every possible language
Grzegorz Brzęczyszczykiewicz is what happens if you try to retardedly force a writing system on to a language that doesn't suit it,

February 20, 2018 - 18:38

Jackson Adams

>tfw using greek alphabet daily
>tfw seeing non-ascii characters in plain english texts all the time
I bet you retard enjoyed windows encoding tables

February 20, 2018 - 18:42

Ryder Martin

Apparently they exist

February 20, 2018 - 18:43

Ryan Jackson

zażółć gęślą jaźń

February 20, 2018 - 18:44

Julian Miller

i like my accents thank you

February 20, 2018 - 19:05

Jose Lee

this

February 20, 2018 - 19:08

Juan Howard

*tips fedora* now that's how a neet should talk

February 20, 2018 - 19:09

Colton Lewis

バカ外人。。。

February 20, 2018 - 19:10

Carter Powell

>Seems awfully indulgent that those with intelligent language systems must subsidize archaic, badly designed caveman writing.
Yeah, but the latin script is the standard for some idiotic reason.

February 20, 2018 - 19:13

Matthew Lopez

UTF-8? Sure.
UTF-16? Fuck no.

February 20, 2018 - 19:15

Charles Campbell

пидopaхи - нe люди.

February 20, 2018 - 19:16

Cooper Ramirez

Ascii was not enough.

actually here's a good video on it OP

youtube.com/watch?v=MijmeoH9LT4

February 20, 2018 - 19:18

Bentley Martinez

cocи хyй быдлo!

February 20, 2018 - 19:19

Julian Hughes

ASCII can't even encode English punctuation properly.
“ and ” are not the same as "
’ and ‘ are not the same as '
— is not the same as -
… is not the same as ...

February 20, 2018 - 19:21

Luis Brown

>his pathetic earth computers can't process ancient antarean, high psician or meta-cassiopeian neurolinear script

February 20, 2018 - 19:22

Michael Jones

UTF-128 is currently being ratified and will release with support for emojis in every single potential shade of human skin colour (but without any other racially defining features), oh and of course twelve-thousand different variations of the homosexual flag.

February 20, 2018 - 19:35

Carter Gutierrez

UTF-16 is more efficient than UTF-8 in some cases.
スマートカエルのポスター。
In [2]: len("スマートカエルのポスター。".encode('UTF-8'))
Out[2]: 39

In [3]: len("スマートカエルのポスター。".encode('UTF-16'))
Out[3]: 28

February 20, 2018 - 19:40

Evan Wood

>

February 20, 2018 - 19:43

Bentley Evans

There's no need for UTF-128 because UTF-8 and UTF-16 are both of variable width.

February 20, 2018 - 19:50

Liam Lee

They are called "emoticons", you monglers.

>en.wikipedia.org/wiki/Emoticon

February 20, 2018 - 19:52

Asher Moore

I prefer the later.

February 20, 2018 - 19:57

Samuel Ortiz

keep in mind you only preserve ameriburgers

February 20, 2018 - 19:58

Andrew Flores

やめてください

February 20, 2018 - 20:02

Gavin James

Did you just get hipster over character encodings? user, try going outside and get some fresh air.

February 20, 2018 - 20:03

Jaxson Hernandez

>Grzegorz Brzęczyszczykiewicz is what happens if you try to retardedly force a writing system on to a language that doesn't suit it,
Elaborate.

February 20, 2018 - 20:03

Luke Torres

UTF-8, and all variable-length encodings, are cancer. UTF-32 should be enough.

February 20, 2018 - 20:07

Ethan Wilson

Slavic languages have sounds which could be hard to express in vanilla Latin, Cyrilic is much more suitable - Гpзeгopз Бpeчишкeвич reads much more natural. I don't know why Polish is using digraphs instead of specialized letters tho.

February 20, 2018 - 20:08

Eli Diaz

apparently 1 sized UTF-16 character can encode more symbols that 1-3 sized UTF-8 character? Holy crap.

February 20, 2018 - 20:09

Xavier Kelly

That's true for everything that fits into the BMP except for Latin, Greek and Cyrillic script. If you're using ASCII characters in Japanese text, you're at a disadvantage against Shift-JIS with either Unicode encoding. Unicode originally was a bigger issue to the Western world which only had used several incompatible 7-bit and 8-bit codes before.

The main advantages of UTF-8 are endian safeness and self-synchronization (a byte value always has a definite position in a multi-byte sequence, and you can tell immediately what is valid UTF-8 input).

February 20, 2018 - 20:14

Joshua White

No, it's not. It sounds ridiculous.

February 20, 2018 - 20:19

Colton Powell

elaborate
It's either a mess of language specific diagraphs or a mess of language specific diacritics. You might as well make a new writing system if you have to make so much modifications.

February 20, 2018 - 20:23

Isaiah Green

How to write ę or ą to be spelled properly in Cyrillic?

February 20, 2018 - 20:44

Jaxson Kelly

>Гpзeгopз Бpeчишкeвич
>Slavic languages
Generalization.

February 20, 2018 - 20:52

Andrew King

Gr-ziegor-z Briecziszkiewicz. Cyrillic bulshit.

February 20, 2018 - 20:58

Aiden Allen

West Slavic bullshit.

February 20, 2018 - 20:59

Mason Torres

>goal of utf is to eliminate duplicate chars from multiple code pages.
>list whitespace in utf-8

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)

February 20, 2018 - 21:03

Lucas Hughes

>Why wasn't ASCII enough?

Because Western Culture needs to be eradicated.

February 20, 2018 - 21:03

Samuel Thompson

It might make sense to you as someone who has grown up in such an environment, much like how an english speaker accepts the weird way the language is spelt but diagraphs and diacritics are just a clunky way of forcing Latin onto a language that clearly has more sounds than what Lating can handle.

February 20, 2018 - 21:04

Leo Torres

Também não sei para quê nós vamos usar uma codificação com mais de 8 bits.

February 20, 2018 - 21:04

Andrew Thompson

But I can write zażółć gęślą jaźń with this bulls it.

February 20, 2018 - 21:13

Jackson Barnes

ѧ and ѫ, probably.
These letters were used for nasal vowels

February 20, 2018 - 21:13

Austin Torres

>“ and ” are not the same as "
>’ and ‘ are not the same as '
>— is not the same as -
>… is not the same as ...

actually the punctuation overprolification is fucking cancer

>there are different signs for minus (−), hyphen (‐), hyphen-minus (-), fullwidth hyphen-minus (-), hard hyphen (‑), en dash (–), em dash (—), figure dash (‒), horizontal bar (―) etc.
>nobody uses them correctly
>they're spammed all over the place breaking code
>no human eye can distinguish them w/o comparing

what a fucking nightmare

February 20, 2018 - 21:14

Joshua Fisher

>Elaborate.

pl.wikipedia.org/wiki/Grzegorz_Brzęczyszczykiewicz

Use google translate.

February 20, 2018 - 21:15

Bentley Foster

>I don't know why Polish is using digraphs instead of specialized letters tho.

Because when we were baptised and embraced Latin we once and for all decided that we belong to THE WEST.

This has been settled in the X century.

February 20, 2018 - 21:18

Adrian Baker

>we decided
lol nobody asked you

February 20, 2018 - 21:21

Caleb Anderson

Its a complete nightmare. Most brainlets on this board though just get as far as calling the right PHP function.

Imagine if they had to implement a text search / replace function in an actual programming language. Suddenly they would understand.

February 20, 2018 - 21:24

Jacob Bailey

>reads much more natural.
Because it's what you do everyday.
If you spent your entire life wiping your ass with bare hand toilet paper might sound ridiculous despite easily being the more hygienic and objectively superior method.
There's no merit in having your own special snowflake writing system if the exactly the same can be achieved with a slightly modified common system.

Besides what you wrote reads Grzegorz Brecziszkewicz not Grzegorz Brzęczyszczykiewicz.
You ommitted half of the characters and pretended your system is somehow the superior one like the fucking retard you are.
Cyryllic has no symbols ą, ę, ń, ł, ó, ć and ź.
Cyryllic doesn't accomodate h/ch and u/ó.

Still no idea what you mean. Are tongue twisters unique to Polish?

February 20, 2018 - 21:25

John Williams

Who else did. It's not like the czech king stormed the land, grabbed pagan mieszko by his clothes and ordered him to marry his daughter OR ELSE.

February 20, 2018 - 21:27

Jaxson Myers

actually the Polish poster here (or maybe there's more of us, we're liking rabbits basically everywhere)

>ordered him to marry his daughter OR ELSE.

yeah but Mieszko and co. knew this 'or else' was coming sooner or later

they were wise enough to join the gang early before the gang turned their eyes on our marshlands to make a gang bang

February 20, 2018 - 21:32

Hunter Clark

I still use this term too but it doesn't mean anything now

February 20, 2018 - 21:39

Parker Rodriguez

Spurdöländ wöuld be very very säd withöut UTF-8

February 20, 2018 - 21:50

Jonathan Flores

>hieroglyphic languages
It's a problem for all languages, that have more distinct sounds than English. ASCII is therefore insufficient for most languages using the Latin alphabet. For German ü, ö, ä, ß are missing, for French à, á, â, ç, ô, è, é etc.

February 20, 2018 - 22:08

Ian Collins

Emoticons, emojis and kaomojis are different things you cockmonglers.

Emoticons are composed by multiple characters (usually ASCII), kaomojis are weeb emoticons, emojis are a single Unicode character.

February 20, 2018 - 22:13

Benjamin Hernandez

Which are duplicate there?

February 20, 2018 - 22:15

Brayden Richardson

All the meme letters needed by the language of the West Mongolians (and pretty much the rest of western civilization) are already in ISO-8859-15. It doesn't leave any room for funnies, though.

February 20, 2018 - 22:37

Brandon Sanders

>Present text to user using "Hello" + U00A0 + "World"
>User searches for "Hello World"
Result not found.

February 20, 2018 - 22:41

Tyler Young

Needless complexity would be dealing with random encodings for different languages, UTF-8 solved it nicely. Even if you believe that non-English speakers should be euthanized, Unicode is still useful for things like math symbols.

>UTF-32
You can encode every possible Unicode codepoint with 21 bits, so using something like "UTF-21" would be more efficient for text files (21 bit values would be stored as 32 bit ints in memory anyway, but it would save space on the disk).
Constant width encodings aren't as useful as you think anyway, since some emojis consist of a modifier and the emoji, meaning you can't just get the length of the text by getting the length of the array, nor can you simply reverse the text by reversing the array of character data.

Those aren't duplicate characters according to the Unicode standard. You may think that's retarded, but it would still be Unicode's fault and not UTF-8's fault.

/Hello\sWorld

February 20, 2018 - 22:46

Nolan Rivera

>using a no-break-space instead of a space
You're retarded, I see.

February 20, 2018 - 22:47

Juan Moore

Data presented to user was input by other user.

February 20, 2018 - 22:53

Brayden Cruz

Fair point regarding UTF/Unicode separation but its a bit like TCP/IP separation. Is UTF-8 commonly used without Unicode?

February 20, 2018 - 22:57

Thomas Young

The point still stands, retardinho. Those characters are extremely rare. No one uses them because he accidentally enters it. Your example is as likely to happen as someone writing "Hello°World" and then some other retard searching for "Hello World" and not getting any search results. Those characters have their uses. Just because you're a little retard and never need something, doesn't mean the rest of the world doesn't need them.
Your retardation is on level as this guy's retarded pretension.

February 20, 2018 - 23:00

Camden Stewart

>Is UTF-8 commonly used without Unicode?
What does that even mean?

February 20, 2018 - 23:01

Cooper Howard

>Is UTF-8 commonly used without Unicode?
UTF literally stands for "Unicode Transformation Format", so it isn't, I'm just saying that it's just a pretty efficient way of encoding Unicode that is backwards compatible with ASCII, so it shouldn't get the blame for Unicode's faults (which aren't really faults anyway, as those spaces serve different purposes).

February 20, 2018 - 23:06

Aaron Martinez

Wow, PHP programmers are slow. Don't worry Brad, I'll break it down for you.
>User 1 copies and paste something from web into text entry box for application.
>Hits save
>Application dutifully saves "Hello+U00A0+World" into db. Because thats what was copied from webpage.
>User 2 searches for "Hello World"
>No results found.

You are very very slow on the uptake. Get a different job.

February 20, 2018 - 23:07

Liam Hughes

Exactly my point Brad.

February 20, 2018 - 23:08

Mason Gutierrez

You just don't get it, do you? He pasted the text with the character deliberately. "Hello+ non-breaking-space + World" isn't the same as "Hello+space+World". Those characters have their use. Just because some initial retard used the wrong character or someone "maliciously" used the wrong one, doesn't mean the standard (Unicode) is broken.

Clearly if you care that much, you'd get a list of all the whitespace characters and perform checks on it if you want to alter a user's intent. You are a smart programmer, surely you know how to do that.

February 20, 2018 - 23:13

Xavier Scott

ロマン字嫌い私達の為

February 20, 2018 - 23:15

Landon Ramirez

> if you care that much, you'd get a list of all the whitespace characters and perform checks on it
The world should do endless work to accommodate my bad designs

.t pottering.

February 20, 2018 - 23:31

Lucas Thompson

>later
You would be dead too, probably

February 20, 2018 - 23:32

Carson Wood

The user thought it was "Hello World". How would the user visually be able to differentiate between Hello+Space+World and Hello+NBSP+World?

February 20, 2018 - 23:33

Oliver Hughes

He wouldn't. That's why he should check his sources or use the proper characters when he writes things himself. Those different whitespaces have a purpose, just like other similar looking characters do. The Latin small letter q looks identical to the Cyrillic letter q. What are you going to do when someone pastes text with that? Are you going to eliminate all the letters that look like a small Latin lowercase q?

>has no arguments
>resorts to using g memes as arguments
Consider euthanizing or educating yourself. One of those is hopeless for you, the other might be legally outlawed.

February 20, 2018 - 23:40

Daniel Rogers

>Latin small letter q looks identical to the Cyrillic letter q
"UNIcode" was supposed to solve exactly that issue, namely end all the redundant chars that existed in previous codepage mappings. So Latin-based languages could share their common letters and only encode the umlauts for germans, accents for frenchies, etc.
Now every language must have a "q" and 20 different whitespaces. Oh well, at least pajeet can write "show me big vagana" in native scribble.

February 20, 2018 - 23:48

Jayden Cox

because muh 卐

February 20, 2018 - 23:51

Brayden Nguyen

>"UNIcode" was supposed to solve exactly that issue, namely end all the redundant chars that existed in previous codepage mappings.
Bullshit. It was created because there was no way to cover all of the characters from different languages and to display them reliably. It literally says so on the Unicode [0]website, but you probably know better than the people who created it. Unless you are the person who created or contributed to the Unicode standard. In that case, correct your website.

[0]unicode.org/standard/WhatIsUnicode.html

February 20, 2018 - 23:54

Aaron Lewis

>For German ü, ö, ä, ß are missing
Not the greatest example though, really, since those characters are literally just abbreviations of ue, oe, ae, and ss.

February 21, 2018 - 00:07

Ryder Cox

this, really nice video, plz watch

February 21, 2018 - 00:08

Oliver Gutierrez

That's almost harmless. What is particularly fun is the combining characters where ä is not the same as ä. Of course there are standard functions for canonicalization of combining sequences, but that requires pulling in megabytes of Unicode data tables into your program, and at least assumes that everyone else your program communicates with uses the same canonicalization scheme (hint: there are four of them).

February 21, 2018 - 00:20

Logan Hughes

>in every single potential shade of human skin colour (but without any other racially defining features)

February 21, 2018 - 00:23

Wyatt Parker

>Why do we need UTF-8? Why wasn't ASCII enough? Did it just introduce needless complexity?
Because most people don't use a butched Latin alphabet.

February 21, 2018 - 00:24

Hudson Harris

Have fun with different byte orderings, though.

February 21, 2018 - 00:25

Cooper Garcia

>implying compression algorithms wouldn't do a better job

February 21, 2018 - 00:31

Liam Hernandez

>elegant fallback
What UTF had before browsers replacing UTF emotes with vector ones was fine. They looked good.

February 21, 2018 - 00:33

Adam Long

>Why do we need [Unicode]?
To represent other languages using a single standard.
>Why wasn't ASCII enough?
Not enough codepoints for many languages
>Did it just introduce needless complexity?
Unicode is as complex as you want it to be. If you want text processing you need to know how to decode, If you want ordering or to know what a given character means you need to take into consideration the locale of the text and so on.

Extra comments:
There was a big mistake in Unicode and that was adding Webdings, Wingdings and other symbols before CJK & other scripts that are currently in use in the real world. Many characters of these scripts would fit under 0xFFFF and the remaining would need just the third byte. With the dumb addition of the symbols, a huge chunk of the character of those scripts require 3, 4 or more characters.

February 21, 2018 - 00:57

Wyatt Allen

Come on, user.

February 21, 2018 - 03:11

1 2 ... 10 Next

UTF-8: Was it worth it?

Last threads