10.10. FAQ: MySQL Chinese, Japanese, and Korean Character Sets

This Frequently-Asked-Questions section comes from the experiences of MySQL's Support and Development groups, after handling many enquiries about CJK (Chinese Japanese Korean) issues.

10.10.1. SELECT shows non-Latin characters as "?"s. Why?

You inserted CJK characters with INSERT, but when you do a SELECT, they all look like “?”. It usually is a setting in MySQL that doesn't match the settings for the application program or the operating system. These are common troubleshooting steps:

  • Find out: what version do you have? The statement SELECT VERSION(); will tell you. This FAQ is for MySQL version 5, so some of the answers here will not apply to you if you have version 4.0 or 4.1.

  • Find out: what character set is the database column really in? Too frequently, people think that the character set will be the same as the server's set (false), or the set used for display purposes (false). Make sure, by saying SHOW CREATE TABLE tablename, or better yet by saying this:

    SELECT character_set_name, collation_name
    FROM   information_schema.columns WHERE  table_schema = your_database_name
    AND    table_name = your_table_name
    AND    column_name = your_column_name;
    

  • Find out: what is the hexadecimal value?

    SELECT HEX(your_column_name)
    FROM your_table_name;
    

    If you see 3F, then that really is the encoding for ?, so no wonder you see “?”. Probably this happened because of a problem converting a particular character from your client character set to the target character set.

  • Find out: is a literal round trip possible, that is, if you select “literal” (or “_introducer hexadecimal-value”) do you get “literal” as a result? For example, with the Japanese Katakana Letter Pe, which looks like ペ', and which exists in all CJK character sets, and which has the code point value (hexadecimal coding) 0x30da, enter:

    SELECT 'ペ' AS `ペ`;         /* or SELECT _ucs2 0x30da; */
    

    If the result doesn't look like , a round trip failed. For bug reports, we might ask people to follow up with SELECT hex('ペ');. Then we can see whether the client encoding is right.

  • Find out: is it the browser or application? Just use mysql (the MySQL client program, which on Windows will be mysql.exe). If mysql displays correctly but your application doesn't, then your problem is probably “Settings”, but consult also the question about “Troubles with Access (or Perl) (or PHP) (etc.)” much later in this FAQ.

    To find your settings, the statement you need here is SHOW VARIABLES. For example:

    mysql> SHOW VARIABLES LIKE 'char%';
    +--------------------------+----------------------------------------+
    | Variable_name            | Value                                  |
    +--------------------------+----------------------------------------+
    | character_set_client     | utf8                                   |
    | character_set_connection | utf8                                   |
    | character_set_database   | latin1                                 |
    | character_set_filesystem | binary                                 |
    | character_set_results    | utf8                                   |
    | character_set_server     | latin1                                 |
    | character_set_system     | utf8                                   |
    | character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
    +--------------------------+----------------------------------------+
    8 rows in set (0.03 sec)
    

    The above are typical character-set settings for an international-oriented client (notice the use of utf8 Unicode) connected to a server in the West (latin1 is a West Europe character set and a default for MySQL).

    Although Unicode (usually the utf8 variant on Unix, usually the ucs2 variant on Windows) is better than “latin”, it's often not what your operating system utilities support best. Many Windows users find that a Microsoft character set, such as cp932 for Japanese Windows, is what's suitable.

    If you can't control the server settings, and you have no idea what your underlying computer is about, then try changing to a common character set for the country that you're in (euckr = Korea, gb2312 or gbk = People's Republic of China, big5 = other China, sjis or ujis or cp932 or eucjpms = Japan, ucs2 or utf8 = anywhere). Usually it is only necessary to change the client and connection and results settings, and there is a simple statement which changes all three at once, namely SET NAMES. For example:

    SET NAMES 'big5';
    

    Once you get the correct setting, you can make it permanent by editing my.cnf or my.ini. For example you might add lines looking like this:

    [mysqld]
    character-set-server=big5
    [client]
    default-character-set=big5
    

10.10.2. Troubles with GB character sets (Chinese)

MySQL supports the two common variants of the GB (“Guojia Biaozhun” or “National Standard”) character sets which are official in the People's Republic of China: gb2312 and gbk. Sometimes people try to insert gbk characters into gb2312, and it works most of the time because gbk is a superset of gb2312. But eventually they try to insert a rarer Chinese character and it doesn't work. (Example: bug #16072 in our bugs database, http://bugs.mysql.com/bug.php?id=16072). So we'll try to clarify here exactly what characters are legitimate in gb2312 or gbk, with reference to the official documents. Please check these references before reporting gb2312 or gbk bugs. We now have a graphic listing of the gbk characters, currently on the site of Mr Alexander Barkov (MySQL's principal programmer for character set issues). The chart is in order according to the gb2312_chinese_ci collation: http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html. MySQL's gbk is in reality “Microsoft code page 936”. This differs from the official gbk for characters A1A4 (middle dot), A1AA (em dash), A6E0-A6F5, and A8BB-A8C0. For a listing of the differences, see http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h. For a listing of gbk/Unicode mappings, see http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT. For MySQL's listing of gbk characters, see http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html.

10.10.3. Troubles with big5 character set (Chinese)

MySQL supports the Big5 character set which is common in Hong Kong and the Republic of China (Taiwan). MySQL's big5 is in reality “Microsoft code page 950”, which is very similar to the original big5 character set. This is a recent change, starting with MySQL version 4.1.16 / 5.0.16. We made the change as a result of a bug report, bug #12476 in our bugs database, http://bugs.mysql.com/bug.php?id=12476 (title: “Some big5 codes are still missing ...”). For example, the following statements work in the current version of MySQL, but not in old versions:

mysql> create table big5 (big5 char(1) character set big5);
Query OK, 0 rows affected (0.13 sec)

mysql> insert into big5 values (0xf9dc);
Query OK, 1 row affected (0.00 sec)

mysql> select * from big5;
+------+
| big5 |
+------+
| 嫺  |
+------+
1 row in set (0.02 sec)

There is a feature request for adding HKSCS extensions (bug #13577 in our bugs database, http://bugs.mysql.com/bug.php?id=13577). People who need the extension may find the suggested patch for bug #13577 is of interest.

10.10.4. Troubles with character-set conversions (Japanese)

MySQL supports the sjis, ujis, cp932, and eucjpms character sets, as well as Unicode. A common need is to convert between character sets. For example, there might be a Unix server (typically with sjis or ujis) and a Windows client (typically with cp932). But conversions can seem to fail. Here's why. In this conversion table, the ucs2 column is the source, and the sjis/cp932/ujis/eucjpms columns are the destination, that is, what the hexadecimal result would be if we used CONVERT(ucs2) or if we assigned a ucs2 column containing the value to an sjis/cp932/ujis/eucjpms column.

character name         ucs2 sjis  cp932 ujis   eucjpms
--------------         ---- ----  ----  ----   -------

BROKEN BAR             00A6   3F    3F  8FA2C3   3F
FULLWIDTH BROKEN BAR   FFE4   3F  FA55    3F   8FA2

YEN SIGN               00A5   3F    3F    20     3F
FULLWIDTH YEN SIGN     FFE5 818F  818F  A1EF     3F

TILDE                  007E   7E    7E    7E     7E
OVERLINE               203E   3F    3F    20     3F

HORIZONTAL BAR         2015 815C  815C  A1BD   A1BD
EM DASH                2014   3F    3F    3F     3F

REVERSE SOLIDUS        005C 815F    5C    5C     5C
FULLWIDTH ""           FF3C   3F  815F    3F   A1C0

WAVE DASH              301C 8160    3F  A1C1     3F
FULLWIDTH TILDE        FF5E   3F  8160    3F   A1C1

DOUBLE VERTICAL LINE   2016 8161    3F  A1C2     3F
PARALLEL TO            2225   3F  8161    3F   A1C2

MINUS SIGN             2212 817C    3F  A1DD     3F
FULLWIDTH HYPHEN-MINUS FF0D   3F  817C    3F   A1DD

CENT SIGN              00A2 8191    3F  A1F1     3F
FULLWIDTH CENT SIGN    FFE0   3F  8191    3F   A1F1

POUND SIGN             00A3 8192    3F  A1F2     3F
FULLWIDTH POUND SIGN   FFE1   3F  8192    3F   A1F2

NOT SIGN               00AC 81CA    3F  A2CC     3F
FULLWIDTH NOT SIGN     FFE2   3F  81CA    3F   A2CC

For example, consider this extract from the table:

                       ucs2 sjis cp932
                       ---- ---- -----
NOT SIGN               00AC 81CA    3F
FULLWIDTH NOT SIGN     FFE2   3F  81CA

It means “for NOT SIGN which is Unicode U+00AC, MySQL converts to sjis code point 0x81CA and to cp932 code point 3F”. (3F is question mark (“?”) and is what we always use when we can't convert.) Now, what should we do if we want to convert sjis 81CA to cp932? Our answer is: “?”. There are serious complaints about this, many people would prefer a “loose” conversion, so that 81CA (NOT SIGN) in sjis becomes 81CA (FULLWIDTH NOT SIGN) in cp932. We are considering changing.

10.10.5. The Great Yen Sign problem (Japanese)

In SJIS the code for Yen Sign (¥) is 5C. In SJIS the code for Reverse Solidus (\) is 5C. Since the above statements are contradictory, confusion often results. Well, to put it more seriously, some versions of Japanese character sets (both sjis and euc) have treated 5C as a reverse solidus, also known as a backslash, and others have treated it as a yen sign. There's nothing we can do, except take sides: MySQL follows only one version of the JIS (Japanese Industrial Standards) standard description, and 5C is Reverse Solidus, always. Should we make a separate character set where 5C is Yen Sign, as another DBMS (Oracle) does? We haven't decided. Certainly not in version 5.1 or 5.2. But if people keep complaining about The Great Yen Sign Problem, that's one possible solution.

10.10.6. Troubles with euckr character set (Korean)

MySQL supports the euckr (Extended Unix Code Korea) character set which is common in South Korea. In theory, problems could arise because there have been several versions of this character set. So far, only one problem has been noted, for Korea's currency symbol. We use the “ASCII” variant of EUC-KR, in which the code point 0x5c is REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR, in which the code point 0x5c is WON SIGN, that is “”. You can't convert Unicode U+20A9 WON SIGN to euckr:

mysql> SELECT CONVERT('₩' USING euckr) AS euckr,
-> HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+
| euckr | hexeuckr |
+-------+----------+
| ?     | 3F       |
+-------+----------+
1 row in set (0.00 sec)

MySQL's graphic Korean chart is here: http://d.udm.net/bar/~bar/charts/euckr_korean_ci.html.

10.10.7. The “Data truncated” message

For illustration, we'll make a table with one Unicode (ucs2) column and one Chinese (gb2312) column.

mysql> CREATE TABLE ch
    -> (ucs2 CHAR(3) CHARACTER SET ucs2,
    -> gb2312 CHAR(3) CHARACTER SET gb2312);
Query OK, 0 rows affected (0.05 sec) 

We'll try to place the rare character in both columns.

mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec) 

Ah, there's a warning. Let's see what it is.

mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------+
| Level   | Code | Message                                     |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)

So it's a warning about the gb2312 column only.

mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch;
+-------+--------------+--------+-------------+
| ucs2  | HEX(ucs2)    | gb2312 | HEX(gb2312) |
+-------+--------------+--------+-------------+
| A汌B | 00416C4C0042 | A?B    | 413F42      |
+-------+--------------+--------+-------------+
1 row in set (0.00 sec)

There are several things that need explanation here.

  1. The fact that it's a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.

  2. The character isn't in the gb2312 character set. We described that problem earlier.

  3. Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (bug #9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.

  4. With SQL_MODE=TRADITIONAL, there would be an error message, but instead of error 2165 you would see: ERROR 1406 (22001): Data too long for column 'gb2312' at row 1.

10.10.8. Troubles with Access, Perl, PHP, etc.

You can't get things to look right with your special program for a GUI front end or browser? Get a direct connection to the server (with mysql on Unix or with mysql.exe on Windows) and try the same query there. If mysql is okay, then the trouble is probably that your application interface needs some initializing. Use mysql to tell you what character set(s) it uses, by saying SHOW VARIABLES LIKE 'char%';. If it's Access, you're probably connecting with MyODBC. So you'll want to check out the Reference Manual page for configuring an ODBC DSN, and pay attention particularly to the illustrations for “SQL command on connect”. You should enter SET NAMES 'big5' (supposing that you use big5) (you don't need a ; here). If it's ASP, you might need to add SET NAMES in the code. Here is an example that has worked in the past:

<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=yourserver;uid=yourusername;" \ 
        & "pwd=yourpassword;database=yourdatabase;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject(“ADODB.Connection”) 
Conn.Open strConnection
%> 

If it's PHP, here's a slightly different user suggestion:

<?php 
  $link = mysql_connect($host,$usr,$pwd); 
  mysql_select_db($db); 
  if (mysql_error()) { print "Database ERROR: " . mysql_error(); } 
  mysql_query("SET CHARACTER SET utf8", $link); 
  mysql_query("SET NAMES 'utf8'", $link); 
?>

In this case, the tipper used SET CHARACTER SET statement to change character_set_client and character_set_result, and used SET NAMES to change character_set_client and character_set_connection and character_set_results. So actually the SET CHARACTER SET statement is redundant. (Incidentally, MySQL people encourage the use of the mysqli extension, rather than the mysql example that this example uses.) Another thing to check with PHP is the browser assumptions. Sometimes a meta tag change in the heading area suffices, for example: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For Connector/J tips, see the manual section in the Connectors chapter titled “Using Character Sets and Unicode”.

10.10.9. How can I get old MySQL 4.0 behaviour back?

In the old days, with MySQL Version 4.0, there was a single “global” character set for both server and client sides, and the decision was made by the server administrator. We changed that starting with MySQL Version 4.1. What happens now is a “handshake”. The MySQL Reference Manual describes it thus:

When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the character_set_client, character_set_results, and character_set_connection system variables. In effect, the server performs a SET NAMES operation using the character set name.

The effect of this is: you can't control the client character set by saying mysqld --character-set-server=utf8. But some Asian customers said that they don't like that, they want the MySQL 4.0 behaviour. So we added a mysqld switch, --character-set-client-handshake, which (and this is the interesting part) can be turned off with --skip-character-set-client-handshake. If you start mysqld with --skip-character-set-client-handshake, then the behaviour is like this: When a client connects, it sends to the server the name of the character set that it wants to use. The server ignores it! Here is an illustration with the handshake switch on or off. Pretend that your favourite server character set is latin1 (of course that's unlikely in a CJK area but it's MySQL's default if there's no my.ini or my.cnf file). Pretend that the client operates with utf8 because that's what the client's operating system supports. Start the server with a default character set, latin1:

mysqld --character-set-server=latin1

Start the client with a default character set, utf8:

mysql --default-character-set=utf8

Show what the current settings are:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)

Stop the client. Stop the server with mysqladmin. Start the server again but this time say “skip the handshake”:

mysqld --character-set-server=utf8 --skip-character-set-client-handshake

Start the client with a default character set, utf8, again. Show what the current settings are, again:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                                 |
| character_set_connection | latin1                                 |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | latin1                                 |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)

As you can see by comparing the SHOW VARIABLES results, the server ignores the client's initial settings if the --skip-character-set-client-handshake is used.

10.10.10. Why do some LIKE and FULLTEXT searches fail?

There is a simple problem with LIKE searches on BINARY and BLOB columns: we need to know the end of a character. With multi-byte character sets, different characters might have different octet lengths. For example, in utf8, A requires one byte but requires three bytes. Illustration:

        +-------------------------+---------------------------+
        | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') |
        +-------------------------+---------------------------+
        |                       1 |                         3 |
        +-------------------------+---------------------------+
        1 row in set (0.00 sec)
      

If we don't know where the first character ends, then we don't know where the second character begins, and even simple-looking searches like LIKE '_A%' will fail. The solution is to use a regular CJK character set in the first place, or convert to a CJK character character set before comparing. Incidentally, this is one reason why MySQL cannot allow encodings of nonexistent characters: It must be strict about rejecting bad input, or it won't know where characters end. There is a simple problem with FULLTEXT: we need to know the end of a word. With Western writing this is rarely a problem because there are spaces between words. With Asian writing this is not the case. We could use half-good solutions, like saying that all Han characters represent words, or depending on (Japanese) changes from Katakana to Hiragana which are due to grammatical endings. But the only good solution requires a dictionary, and we haven't found a good open-source dictionary.

10.10.11. What CJK character sets are available?

The list of CJK character sets may vary depending on version. For example, the eucjpms character set is a recent addition. But the language name appears in the DESCRIPTION column for every entry in information_schema.character_sets. Therefore, to get a current list of all the non-Unicode CJK character sets, say:

mysql> SELECT character_set_name, description
    -> FROM information_schema.character_sets
    -> WHERE description LIKE '%Chinese%'
    -> OR    description LIKE '%Japanese%'
    -> OR    description LIKE '%Korean%'
    -> ORDER BY character_set_name;
+--------------------+---------------------------+
| character_set_name | description               |
+--------------------+---------------------------+
| big5               | Big5 Traditional Chinese  |
| cp932              | SJIS for Windows Japanese |
| eucjpms            | UJIS for Windows Japanese |
| euckr              | EUC-KR Korean             |
| gb2312             | GB2312 Simplified Chinese |
| gbk                | GBK Simplified Chinese    |
| sjis               | Shift-JIS Japanese        |
| ujis               | EUC-JP Japanese           |
+--------------------+---------------------------+
8 rows in set (0.01 sec)

10.10.12. Is character X available in all character sets?

The majority of everyday-use Chinese/Japanese characters (simplified Chinese and basic non-halfwidth Kana Japanese) appear in all CJK character sets. Here is a stored procedure which accepts a UCS-2 Unicode character, converts it to all other character sets, and displays the results in hexadecimal.

DELIMITER //

CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN

CREATE TABLE tj
             (ucs2 CHAR(1) character set ucs2,
              utf8 CHAR(1) character set utf8,
              big5 CHAR(1) character set big5,
              cp932 CHAR(1) character set cp932,
              eucjpms CHAR(1) character set eucjpms,
              euckr CHAR(1) character set euckr,
              gb2312 CHAR(1) character set gb2312,
              gbk CHAR(1) character set gbk,
              sjis CHAR(1) character set sjis,
              ujis CHAR(1) character set ujis);

INSERT INTO tj (ucs2) VALUES (ucs2_char);

UPDATE tj SET utf8=ucs2,
              big5=ucs2,
              cp932=ucs2,
              eucjpms=ucs2,
              euckr=ucs2,
              gb2312=ucs2,
              gbk=ucs2,
              sjis=ucs2,
              ujis=ucs2;

/* If there's a conversion problem, UPDATE will produce a warning. */

SELECT hex(ucs2) AS ucs2,
       hex(utf8) AS utf8,
       hex(big5) AS big5,
       hex(cp932) AS cp932,
       hex(eucjpms) AS eucjpms,
       hex(euckr) AS euckr,
       hex(gb2312) AS gb2312,
       hex(gbk) AS gbk,
       hex(sjis) AS sjis,
       hex(ujis) AS ujis
FROM tj;

DROP TABLE tj;

END//

The input can be any single ucs2 character, or it can be the code point value (hexadecimal representation) of that character. Here's an example of what P_CONVERT() can do. An earlier answer said that the character “Katakana Letter Pe” appears in all CJK character sets. We know that the code point value of Katakana Letter Pe is 0x30da. (By the way, we got the name from Unicode's list of ucs2 encodings and names: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.) So we'll say:

mysql> CALL P_CONVERT(0x30da)//
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)

Since none of the column values is 3F, we know that every conversion worked.

10.10.13. Strings don't sort correctly in Unicode (I)

Sometimes people observe that the result of a utf8_unicode_ci or ucs2_unicode_ci search or ORDER BY sort is not what they think a native would expect. Although we never rule out the chance that there is a bug, we have found in the past that people are not correctly reading the standard table of weights for the Unicode Collation Algorithm. So, here's how to check whether we're using the right collation. The correct table for MySQL is this one: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. This is different from the first table you will find by navigating from the unicode.org home page. MySQL deliberately uses the older 4.0.0 “allkeys” table, instead of the current 4.1.0 table. We are very wary about changing ordering which affects indexes. Here is an example of a problem that we handled recently, for a complaint in our bugs database, http://bugs.mysql.com/bug.php?id=16526:

mysql> CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM tj WHERE s1 = 'か';
+------+
| s1   |
+------+
| が  |
| か  |
+------+
2 rows in set (0.00 sec)

If your eyes are sharp, you'll see that the character in the first result row isn't the one that we searched for. Why did MySQL retrieve it? First we look for the Unicode code point value, which is possible by reading the hexadecimal number for the ucs2 version of the characters:

mysql> SELECT s1,HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1   | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が  | 304C                        |
| か  | 304B                        |
+------+-----------------------------+
2 rows in set (0.03 sec)

Now let's search for 304B and 304C in the 4.0.0 allkeys table. We'll find these lines:

304B  ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
304C  ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM

The official Unicode names (following the “#” mark) are informative; they tell us the Japanese syllabary (Hiragana), the informal classification (letter instead of digit or punctuation), and the Western identifier (KA or GA, which happen to be voiced/unvoiced components of the same letter pair). More importantly, the Primary Weight (the first hexadecimal number inside the square brackets) is 1E57 on both lines. For comparisons in both searching and sorting, MySQL pays attention only to the Primary Weight, it ignores all the other numbers. So now we know that we're sorting and correctly according to the Unicode specification. If we wanted to distinguish them, we'd have to use a non-Unicode-Collation-Algorithm collation (utf8_unicode_bin or utf8_general_ci), or compare the HEX() values, or say ORDER BY CONVERT(s1 USING sjis). Being correct “according to Unicode” isn't enough, of course: the person who submitted the bug was equally correct. We plan to add another collation for Japanese according to the JIS X 4061 standard, where voiced/unvoiced letters like KA/GA are distinguishable for ordering purposes.

10.10.14. Strings don't sort correctly in Unicode (II)

You're using Unicode (ucs2 or utf8), and you know what the Unicode sort order is (see the previous question and answer), but MySQL still seems to sort your table wrong? This might be easy.

mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Hmm, the character set looks okay. Let's look at the information_schema for this column.

mysql> SELECT column_name, character_set_name, collation_name
    -> FROM information_schema.columns
    -> WHERE column_name = 's1'
    -> AND table_name = 't';
+-------------+--------------------+-----------------+
| column_name | character_set_name | collation_name  |
+-------------+--------------------+-----------------+
| s1          | ucs2               | ucs2_general_ci |
+-------------+--------------------+-----------------+
1 row in set (0.01 sec)

Oops, the collation is ucs2_general_ci instead of ucs2_unicode_ci! Here's why:

mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)

For ucs2 and utf8, the “general” collation is the default. To specify that you wanted a “unicode” collation, you should have specified COLLATE ucs2_unicode_ci.

10.10.15. My supplementary characters get rejected

Right. MySQL doesn't support supplementary characters (characters which need more than 3 bytes with UTF-8). We support only what Unicode calls the Basic Multilingual Plane / Plane 0. Only a few very rare Han characters are supplementary; support for them is uncommon. This has led to bug #12600 (http://bugs.mysql.com/bug.php?id=12600) which we rejected as “not a bug”. With utf8, we must truncate an input string when we encounter bytes that we don't understand. Otherwise, we wouldn't know how long the bad multi-byte character is. A workaround is: if you use ucs2 instead of utf8, then the bad characters will change to question marks, but there will be no truncation. Or change the data type to BLOB or BINARY, which have no validity checking. In our bugs database, bug #14052 (http://bugs.mysql.com/bug.php?id=14052) is a feature request for Wikipedia, asking us to support supplementary characters extending ucs2 as well as utf8.

10.10.16. Shouldn't it be CJKV (V for Vietnamese)?

No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters. Another question that has come up (once) is a request for specialized Vietnamese collation, see http://bugs.mysql.com/bug.php?id=4745. We might do something about it someday, if many more requests arise.

10.10.17. Will MySQL fix any CJK problems in version 5.1?

Yes. We're changing the names of files and directories. Here's an example, using mysql as root under Linux:

  1. Create a table with a name containing a Han character:

    mysql> CREATE TABLE tab_楮 (s1 INT);
    Query OK, 0 rows affected (0.07 sec)
    

  2. Find out where MySQL stores database files:

    mysql> SHOW VARIABLES LIKE 'datadir';
    +---------------+-----------------------+
    | Variable_name | Value                 |
    +---------------+-----------------------+
    | datadir       | /usr/local/mysql/var/ |
    +---------------+-----------------------+
    1 row in set (0.00 sec)
    

  3. Look at the directory to see the MyISAM table files:

    # cd /usr/local/mysql/var/dba
    # dir tab_*
    -rw-rw----  1 root root    0 2006-05-16 10:22 tab_@696e.MYD
    -rw-rw----  1 root root 1024 2006-05-16 10:22 tab_@696e.MYI
    -rw-rw----  1 root root 8556 2006-05-16 10:22 tab_@696e.frm
    

Notice that MySQL has converted the Han character to @ + (Unicode value of Han character), that is, to a purely ASCII representation. This solves an old problem, that database files weren't portable, because some computers wouldn't allow in a file name. Conversion to the new file names will be automatic when you upgrade to version 5.1. This should take care of bug #6313 in our bugs database, http://bugs.mysql.com/bug.php?id=6313.

10.10.18. When will MySQL translate the manual again?

A Beijing-based group has produced a Simplified Chinese version for us under contract. It's complete and can be found on http://dev.mysql.com/doc/#chinese-5.1. It's up to date as of version 5.1.2. The Japanese manual can be downloaded from http://dev.mysql.com/doc/#japanese-4.1. It is still for version 4.1.

10.10.19. Whom can I talk to?

Check http://dev.mysql.com/user-groups/ to see if there is a MySQL user group near you. If there isn't: why not start one yourself? To contact a sales engineer in MySQL KK's Japan office:

Tel: +81(0)3-5326-3133
Fax: +81(0)3-5326-3001
Email: dsaito@mysql.com

To see feature requests about language issues:

  • Go to http://bugs.mysql.com.

  • Click Advanced Search.

  • In the Severity dropdown box, click S4 (Feature Request).

  • In the list box beside Category, click Character Sets.

  • Click the Search button.

You can post CJK questions, or see previous answers, on MySQL's “Character Sets, Collation, Unicode” forum: http://forums.mysql.com/list.php?103. MySQL plans to add native-language forums on http://forums.mysql.com/ very soon.