STR01-G: Use Char Data Type with Caution for Unicode
Introduction
The char
data type was based on the original Unicode specification, which used 16-bit encoding. However, Unicode has since expanded to include characters that require more than 16 bits, with code points now ranging from U+0000
to U+10FFFF
. The characters from U+0000
to U+FFFF
are part of the Basic Multilingual Plane (BMP), while characters beyond U+FFFF
are known as supplementary characters. These supplementary characters, though rare, are used in some contexts, such as family names in Eastern Asia.
Java represents supplementary characters as pairs of Unicode code units called surrogates to support supplementary characters without changing the char data type and causing compatibility issues with older Java programs. According to the Java API, the platform uses the UTF-16 representation in char
arrays and the String
and StringBuffer
classes. Supplementary characters are represented as pairs of char
values: the first from the high-surrogates range (U+D800
-U+DBFF
) and the second from the low-surrogates range (U+DC00
-U+DFFF
).
Therefore, a char
value can only represent BMP code points, including surrogate code points or units of the UTF-16 encoding. For example, the UTF-16 encoding of characters like the capital letter 'A
', the German "eszett" or "sharp S" (ß
), the Han character 東
(meaning "east"), and the Deseret capital letter long I (𐐀
) demonstrate how different characters are represented in UTF-16.
An int
value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int
are used to represent Unicode code points, and the upper (most significant) 11 bits must be zero. Similar to UTF-8 (see STR00-J. Don't form strings containing partial characters from variable-width encodings), UTF-16 is a variable-width encoding. Because the UTF-16 representation is also used in char arrays and the String
and StringBuffer
classes, care must be taken when manipulating string data in Java. Conformance with this requirement typically requires using methods that accept a Unicode code point as an int
value and avoiding methods that accept a Unicode code unit as a char
value. These latter methods cannot support supplementary characters.
Security Risks
Handling Unicode characters improperly can introduce several security vulnerabilities in Java applications. Here are the primary risks:
Methods that handle only 16-bit char
values might misinterpret or ignore supplementary characters. For instance, the Character.isLetter('\uD840')
method returns false because it treats values from the surrogate range as undefined characters. This can lead to security vulnerabilities if the application fails to process characters correctly.
Mishandling of Unicode characters might allow attackers to inject unexpected characters or strings into the application, potentially leading to injection attacks. For example, a function that incorrectly processes Unicode could be exploited to bypass security checks or inject malicious code.
If supplementary characters are not handled correctly, the integrity and authenticity of data can be compromised. This is particularly important in applications that require accurate processing and storage of multilingual data
Non-Compliant Code Example
This non-compliant code example attempts to trim leading letters from string
:
// Non-compliant Code Example
static function trim(string : String) : String {
var cui = 0
while (cui < string.length()) {
var cu = string.charAt(cui)
if (not isLetter(cu)) break
cui++
}
return string.substring(cui)
}
Unfortunately, the trim()
method may fail because it uses the character form of the Character.isLetter()
method. Methods that accept only a char value cannot support supplementary characters. According to the Java API [API 2014] class Character
documentation:
They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840')
returns false
, even though this specific value, if followed by any low-surrogate value in a string, would represent a letter.
Compliant Solution
This compliant solution corrects the problem with supplementary characters using the integer form of the Character.isLetter()
method that accepts a Unicode code point as an int
argument. Java library methods that accept an int value support all Unicode characters, including supplementary characters.
// Compliant Solution
static function trim(string : String) : String {
var cpi = 0 // code point index
foreach (cp in string.codePoints().iterator()) {
if (not Character.isLetter(cp)) break
cpi++
}
return string.substring(string.offsetByCodePoints(0, cpi))
}
Risk Assessment
Incorrectly assuming that an object of type char can fully represent a Unicode code point can result in incorrect logic when testing code units from surrogate ranges.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
STR01-G | Medium | Likely | Medium | L8 | L2 |
Additional Resources
- Gosu Secure Coding Guidelines
- CERT Oracle Secure Coding Standard for Java
- Java SE 11 API Specification
- Secure Coding Rules for Java, Part I
- The Java Tutorials
- Unicode Standard Annex #15, Unicode Normalization Forms
- Unicode Technical Report #36, Unicode Security Considerations
- Unravelling Unicode: A Bag of Tricks for Bug Hunting
Was this page helpful?