I've begun implementing unicode in Leadwerks Game Engine 5. It's not quite as simple as "switch all string variables to another data type".
First, I will give you a simple explanation of what unicode is. I am not an expert so feel free to make any corrections in the comments below.
When computers first started drawing text we used a single byte for each character. One byte can describe 256 different values and the English language only has 26 letters, 10 numbers, and a few other characters for punctuation so all was well. No one cared or thought about supporting other languages like German with funny umlauts or the thousands of characters in the Chinese language.
Then some people who were too smart for their own good invented a really complicated system called unicode. Unicode allows characters beyond the 256 character limit of a byte because it can use more than one byte per character. But unicode doesn't really store a letter, because that would be too easy. Instead it stores a "code point" which is an abstract concept. Unfortunately the people who originally invented unicode were locked away in a mental asylum where they remain to this day, so no one in the real world actually understands what a code point is.
There are several kinds of unicode but the one favored by nerds who don't write software is UTF-8. UTF-8 uses just one byte per character, unless it uses two, or sometimes four. Because each character can be a different length there is no way to quickly get a single letter of a string. It would be like trying to get a single byte of a compressed zip file; you have to decompress the entire file to read a byte at a certain position. This means that commands like Replace(), Mid(), Upper(), and basically any other string manipulation commands simply will not work with UTF-8 strings.
Nonetheless, some people still promote UTF-8 religiously because they think it sounds cool and they don't actually write software. There is even a UTF-8 Everywhere Manifesto. You know who else had a manifesto? This guy, that's who:
Typical UTF-8 proponent.
Here's another person with a "manifesto":
The Unabomber (unibomber? Coincidence???)
The fact is that anyone who writes a manifesto is evil, therefore UTF-8 proponents are evil and should probably be imprisoned for crimes against humanity. Microsoft sensibly solved this problem by using something called a "wide string" for all the windows internals. A C++ wide string (std::wstring) is a string made up of wchar_t values instead of char values. (The std::string data type is sometimes called a "narrow string"). In C++ you can set the value of a wide string by placing a letter "L" (for long?) in front of the string:
std::wstring s = L"Hello, how are you today?";
The C++11 specification defines a wchar_t value as being composed of two bytes, so these strings work the same across different operating systems. A wide string cannot display a character with an index greater than 65535, but no one uses those characters so it doesn't matter. Wide strings are basically a different kind of unicode called UTF-16 and these will actually work with string manipulation commands (yes there are exceptions if you are trying to display ancient Vietnamese characters from the 6th century but no one cares about that).
For more detail you can read this article about the technical details and history of unicode (thanks @Einlander).
At first I thought "no problem, I will just turn all string variables into wstrings and be done with it". However, after a couple of days it became clear that this would be problematic. Leadwerks interfaces with a lot of third-party libraries like Steamworks and Lua that make heavy use of strings. Typically these libraries will accept a chr* value for the input, which we know might be UTF-8 or it might not (another reason UTF-8 is evil). The engine ended up with a TON of string conversions that I might be doing for no reason. I got the compiler down to 2991 errors before I started questioning whether this was really needed.
Exactly what do we need unicode strings for? There are three big uses:
Read and save files.
Display text in different languages.
Print text to the console and log.
Reading files is mostly an automatic process because the user typically uses relative file paths. As long as the engine internally uses a wide string to load files the user can happily use regular old narrow strings without a care in the world (and most people probably will).
Drawing text to the screen or on a GUI widget is very important for supporting different languages, but that is only one use. Is it really necessary to convert every variable in the engine to a wide string just to support this one feature?
Printing strings is even simpler. Can't we just add an overload to print a wide string when one is needed?
I originally wanted to avoid mixing wide and narrow strings, but even with unicode support most users are probably not even going to need to worry about using wide strings at all. Even if they have language files for different translations of their game, they are still likely to just load some strings automatically without writing much code. I may even add a feature that does this automatically for displayed text. So with that in mind, I decided to roll everything back and convert only the parts of the engine that would actually benefit from unicode and wide strings.
Second Try + Global Functions
To make the API simpler Leadwerks 5 will make use of some global functions instead of trying to turn everything into a class. Below are the string global functions I have written:
std::string String(const std::wstring& s);
std::string Right(const std::string& s, const int length);
std::string Left(const std::string& s, const int length);
std::string Replace(const std::string& s, const std::string& from, const std::string& to);
int Find(const std::string& s, const std::string& token);
std::vector<std::string> Split(const std::string& s, const std::string& sep);
std::string Lower(const std::string& s);
std::string Upper(const std::string& s);
There are equivalent functions that work with wide strings.
std::wstring StringW(const std::string& s);
std::wstring Right(const std::wstring& s, const int length);
std::wstring Left(const std::wstring& s, const int length);
std::wstring Replace(const std::wstring& s, const std::wstring& from, const std::wstring& to);
int Find(const std::string& s, const std::wstring& token);
std::vector<std::wstring> Split(const std::wstring& s, const std::wstring& sep);
std::wstring Lower(const std::wstring& s);
std::wstring Upper(const std::wstring& s);
The System::Print() command has become a global Print() command with a couple of overloads for both narrow and wide strings:
void Print(const std::string& s);
void Print(const std::wstring& s);
The file system commands are now global functions as well. File system commands can accept a wide or narrow string, but any functions that return a path will always return a wide string:
std::wstring SpecialDir(const std::string);
bool ChangeDir(const std::string& path);
bool ChangeDir(const std::wstring& path);
std::wstring RealPath(const std::string& path);
std::wstring RealPath(const std::wstring& path);
This means if you call ReadFile("info.txt") with a narrow string the file will still be loaded even if it is located somewhere like "C:/Users/约书亚/Documents" and it will work just fine. This is ideal since Lua 5.3 doesn't support wide strings, so your game will still run on computers around the world as long as you just use local paths like this:
Or you can specify the full path with a wide string:
LoadModel(CurrentDir() + L"Models/car.mdl");
The window creation and text drawing functions will also get an overload that accepts wide strings. Here's a window created with a Chinese title:
So in conclusion, unicode will be used in Leadwerks and will work for the most part without you needing to know or do anything different, allowing games you develop (and Leadwerks itself) to work correctly on computers all across the world.