<html>
<head>
<base href="http://www.jacorb.org/bugzilla/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Incorrect UTF-8 conversion for non-BMP characters"
href="http://www.jacorb.org/bugzilla/show_bug.cgi?id=969">969</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Incorrect UTF-8 conversion for non-BMP characters
</td>
</tr>
<tr>
<th>Product</th>
<td>JacORB
</td>
</tr>
<tr>
<th>Version</th>
<td>3.3
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P5
</td>
</tr>
<tr>
<th>Component</th>
<td>ORB
</td>
</tr>
<tr>
<th>Assignee</th>
<td>jacorb-bugs@lists.spline.inf.fu-berlin.de
</td>
</tr>
<tr>
<th>Reporter</th>
<td>peter.klotz@ith-icoserve.com
</td>
</tr></table>
<p>
<div>
<pre>Methods read_char() and write_char() in class Utf8CodeSet convert data
characterwise from/to UTF-8. This presents a problem when dealing with UTF-16
characters outside the Basic Multilingual Plane (all code points beyond
U+FFFF). Here UTF-16 requires the use of surrogate pairs which means that two
Java chars form a single character.
The following currently happens in JacORB 3.3 when sending Unicode Character
U+1044F (DESERET SMALL LETTER EW, see
<a href="http://www.fileformat.info/info/unicode/char/1044f/index.htm">http://www.fileformat.info/info/unicode/char/1044f/index.htm</a>):
Java UTF-16 string: "\uD801\uDC4F"
Converted into UTF-8 and received by omniORB: "\xed\xa0\x81" "\xed\xb1\x8f"
The correct UTF-8 encoding would be: "\xf0\x90\x91\x8f"
So JacORB simply sees each surrogate as a character of its own and encodes it
into UTF-8. This leads to 6 bytes whereas the correct encoding would be 4 byte
in length.
To fix this, it would be necessary that JacORB no longer performs its
conversion solely on Java char basis. The conversion classes should be able to
handle Java strings. This would allow the conversion class to detect parts of
surrogate pairs and convert them in a single step into the correct destination
encoding.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>