Skip to main content

STR01-G: Use Char Data Type with Caution for Unicode


Introduction

The char data type was based on the original Unicode specification, which used 16-bit encoding. However, Unicode has since expanded to include characters that require more than 16 bits, with code points now ranging from U+0000 to U+10FFFF. The characters from U+0000 to U+FFFF are part of the Basic Multilingual Plane (BMP), while characters beyond U+FFFF are known as supplementary characters. These supplementary characters, though rare, are used in some contexts, such as family names in Eastern Asia.

Java represents supplementary characters as pairs of Unicode code units called surrogates to support supplementary characters without changing the char data type and causing compatibility issues with older Java programs. According to the Java API, the platform uses the UTF-16 representation in char arrays and the String and StringBuffer classes. Supplementary characters are represented as pairs of char values: the first from the high-surrogates range (U+D800-U+DBFF) and the second from the low-surrogates range (U+DC00-U+DFFF).

Therefore, a char value can only represent BMP code points, including surrogate code points or units of the UTF-16 encoding. For example, the UTF-16 encoding of characters like the capital letter 'A', the German "eszett" or "sharp S" (ß), the Han character (meaning "east"), and the Deseret capital letter long I (𐐀) demonstrate how different characters are represented in UTF-16.

Validating normalized forms

An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points, and the upper (most significant) 11 bits must be zero. Similar to UTF-8 (see STR00-J. Don't form strings containing partial characters from variable-width encodings), UTF-16 is a variable-width encoding. Because the UTF-16 representation is also used in char arrays and the String and StringBuffer classes, care must be taken when manipulating string data in Java. Conformance with this requirement typically requires using methods that accept a Unicode code point as an int value and avoiding methods that accept a Unicode code unit as a char value. These latter methods cannot support supplementary characters.

Security Risks

Handling Unicode characters improperly can introduce several security vulnerabilities in Java applications. Here are the primary risks:

1. Incomplete or Incorrect Handling of Supplementary Characters:

Methods that handle only 16-bit char values might misinterpret or ignore supplementary characters. For instance, the Character.isLetter('\uD840') method returns false because it treats values from the surrogate range as undefined characters. This can lead to security vulnerabilities if the application fails to process characters correctly.

2. Code Injection Attacks:

Mishandling of Unicode characters might allow attackers to inject unexpected characters or strings into the application, potentially leading to injection attacks. For example, a function that incorrectly processes Unicode could be exploited to bypass security checks or inject malicious code.

1. Data Integrity and Authenticity Issues:

If supplementary characters are not handled correctly, the integrity and authenticity of data can be compromised. This is particularly important in applications that require accurate processing and storage of multilingual data

Non-Compliant Code Example

This non-compliant code example attempts to trim leading letters from string:

// Non-compliant Code Example

static function trim(string : String) : String {
var cui = 0
while (cui < string.length()) {
var cu = string.charAt(cui)
if (not isLetter(cu)) break
cui++
}
return string.substring(cui)
}

Unfortunately, the trim() method may fail because it uses the character form of the Character.isLetter() method. Methods that accept only a char value cannot support supplementary characters. According to the Java API [API 2014] class Character documentation:

They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value, if followed by any low-surrogate value in a string, would represent a letter.

Compliant Solution

This compliant solution corrects the problem with supplementary characters using the integer form of the Character.isLetter() method that accepts a Unicode code point as an int argument. Java library methods that accept an int value support all Unicode characters, including supplementary characters.

// Compliant Solution

static function trim(string : String) : String {
var cpi = 0 // code point index
foreach (cp in string.codePoints().iterator()) {
if (not Character.isLetter(cp)) break
cpi++
}
return string.substring(string.offsetByCodePoints(0, cpi))
}

Risk Assessment

Incorrectly assuming that an object of type char can fully represent a Unicode code point can result in incorrect logic when testing code units from surrogate ranges.

RuleSeverityLikelihoodRemediation CostPriorityLevel
STR01-GMediumLikelyMediumL8L2

Additional Resources