Jump to content

Unicode Confusion


Josh
 Share

Recommended Posts

I have started switching the engine over to unicode by replacing all occurances of std:;string with std::wstring.  There are a bunch of little functions and variables that have to be changed (char to wchar_t, "" to L"", etc.) but it is pretty straightforward.

Lua 5.3 supposedly supports unicode strings but the manual states that the lua_getglobal() function accepts a char* parameter:
https://www.lua.org/manual/5.3/manual.html#lua_getglobal

There's a little information here but it is not very clear:
https://www.lua.org/manual/5.3/manual.html#6.5

So how are you supposed to make unicode work in Lua? :blink:  Switching data back and forth between wstrings and strings is a recipe for disaster.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

I believe this code will successfully open a weird-character file on any platform:

std::string filename = u8"⺹.txt";
	
#ifdef _WIN32
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
auto f = _wfopen(converter.from_bytes(filename).c_str(), L"rb");
#else
auto f = _fopen(filename.c_str(), "rb");
#endif

 

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

Unicode sucks because it uses a variable character size.  This makes search and replace operations very difficult.  However, Linux does not accept wstrings in commands like fopen.  At this point I am thinking we will store strings as wstrings and then convert to UTF-8 std::strings when calling Linux system commands.  Why is everything in Linux designed as if computers have one kb memory?

The whole unicode design is idiotic.  They made a very complicated system when all they had to do was use 2 bytes per character and have one number for every character.  I guess making something that actually works would be "boring".  Yes, I know there are ancient vietnamese characters that are no longer in use that push the character count past 65,000 but who cares about that?  Why should be handicap modern computing for a bunch of vietnamese people who died three centuries ago?  They're dead so they don't care, and if they had anything interesting to say it would have been made into a movie already.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

I got a window created with chinese characters but I can't print them out to the console:

wprintf(L"%ls \n", L"A wide string");
wprintf(L"%ls \n", L"勝遂記暮恐村日性周報著身催");
wprintf(L"Why? 为什么?\n");

 

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

I'm thinking my console font probably just cannot display the characters.

I tried to write a wstring to a text file but that didn't work out too well either when I opened it in Notepad++.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

I found this blog post  a few days ago through reddit to be insightful. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

 

It made me realise everything other than utf-16 is basically a beautiful hack.

 

it also speaks to the about wcs functions in c++

  • Upvote 2
Link to comment
Share on other sites

1 hour ago, Einlander said:

I found this blog post  a few days ago through reddit to be insightful. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

 

It made me realise everything other than utf-16 is basically a beautiful hack.

 

it also speaks to the about wcs functions in c++

Thanks for the info.  You're right, everything but UCS-2 (two byte) Unicode is a stupid idea because it means you are translating text through two layers of conversion.  (The fact that some characters no one uses go beyond the 65,000 character limit does not matter.)

So in Leadwerks we will replace all strings with wstring, replace all Windows API calls with Windows API -W, and for Lua or Linux system calls we convert the wstring to UTF-8 (for opening files, etc.).  Strings will be stored in files as UCS-2.

It is interesting to see that all the tech enthusiasts keep claiming UTF-8 is the best but people who actually write software use UTF-16:
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

This is getting very complicated and I am reconsidering this.

Why do we need Russian and Chinese characters?

  • Loading or saving a file.
  • Drawing text on the screen or in a GUI element.
  • Storing a variable for one of the above two purposes.

Do we really need to change every other string in Leadwerks in order to accommodate these goals, or can we simply add overloads for a few commands and use std::wstring for internal file path values?

Do we care if the user can name an entity "汽车" in the editor, or should they be expected to use latin characters for something like this?

I don't know if Lua 5.3 will really support unicode strings.

I don't know if the Steamworks commands use unicode at all.  They all just accept a char* value.

I don't know if these will be stored the same way on Windows and Linux.

I still have 2991 errors in the engine to fix.  At first I thought we should change every single variable but now I am not sure if that is a good idea.

I could just add a few commands like this and be done with it:

Widget::SetText(const std::string& text)
Widget::SetText(const std::wstring& text)
Context::DrawText(const std::string& text)
Context::DrawText(const std::wstring& text)
shared_ptr<Model> LoadModel(const std::string& path)
shared_ptr<Model> LoadModel(const std::wstring& path)

However, this means potentially a mix of std::string and std::wstring values will be present in the engine.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

I would make sure that the rest of the engine works with utf8 and let lua itself fail with the encoding. 5.3 has utf8 support https://www.lua.org/manual/5.3/manual.html#6.5 but it's not very robust.

Since it is still early, you do have the option to choose Lua derived language or a completely different language not based on Lua. As distasteful as it may be the API is changing, the scripts will need to be updated and it might be simpler to start over early with something else.

Who knows.

Edited by Einlander
Link to comment
Share on other sites

13 hours ago, Einlander said:

I would make sure that the rest of the engine works with utf8 and let lua itself fail with the encoding. 5.3 has utf8 support https://www.lua.org/manual/5.3/manual.html#6.5 but it's not very robust.

Since it is still early, you do have the option to choose Lua derived language or a completely different language not based on Lua. As distasteful as it may be the API is changing, the scripts will need to be updated and it might be simpler to start over early with something else.

Who knows.

Then say goodbye to String::Split(), Lower(), Upper(), Mid() and all other string manipulation commands, and your file paths will have to be 100% exact or files will fail to load.  UTF-8 is a fraud and its proponents should be imprisoned for crimes against humanity.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

56 minutes ago, Einlander said:

Hey now, I like utf8, I have just never had to deal with coding anything Unicode on Linux. All os's could have conflicting implementations. Is there a bsd/public domain lib that handles Unicode ?

Haha, yeah that is the catch.  It's basically a compressed format so traversing it is impossible.

My job is to make tools you love, with the features you want, and performance you can't live without.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...