Skip to main content

STR00-G: String Normalization


Introduction

When dealing with input strings from untrusted sources, it's important to validate or sanitize the data before using it in any way. Normalizing strings before comparing characters or character sequences can help reduce the search space that needs to be examined. Normalization is the lossy conversion of input data to the simplest known (and anticipated) form and can result in data loss. However, normalization is not always necessary, for example, if you are only searching for a specific character like the Unicode exclamation mark (U+0021). Despite this, normalization and other modifications should be done before validating the data to ensure accurate comparisons and to prevent security vulnerabilities arising from different representations of the same logical string.

The following pattern is typically used to validate normalized forms of string data; the original form can then be used for further processing, as shown.

Validating normalized forms

Non-Compliant Code Example (Unicode)

Character information in Gosu is based on the Unicode standard. Applications that accept untrusted input should normalize the input before validating it. Normalization is important because, in Unicode, the same string can have many different representations. According to the Unicode standard [Davis 2008], annex #15, Unicode Normalization Forms:

When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. The Normalizer.normalize() method transforms Unicode text into the standard normalization forms described in Unicode Standard Annex #15 Unicode Normalization Forms: | Form | Description | | --- | --- | | RuleSeveritySER00-GHigh | RuleSeveritySER00-GHigh | | RuleSeveritySER00-GHigh | RuleSeveritySER00-GHigh | | RuleSeveritySER00-GHigh | RuleSeveritySER00-GHigh | | RuleSeveritySER00-GHigh | RuleSeveritySER00-GHigh |

This non-compliant code attempts to validate that a string contains no angle bracket [<>] characters. It returns the string if it does not contain angle bracket characters; otherwise, it throws IllegalStateException.

// Non-compliant Code Example

static function validateThenNormalize(input : String) : String {
// Validate
var pattern : Pattern = Pattern.compile("[<>]"); // angle brackets
var matcher : Matcher = pattern.matcher(input);
if (matcher.find()) {
// Found blacklisted tag
throw new IllegalStateException();
}
// Normalize
return Normalizer.normalize(input, Form.NFKC);
}

Unfortunately, this does not normalize the string before validation. Therefore, the system accepts the invalid input.

Compliant Solution (Unicode)

This compliant solution normalizes the string before validating it. Alternative representations of the angle brackets, such as the small less-than sign (\uFE64) and the small greater-than sign (\uFE65), are normalized to the canonical angle brackets. Consequently, input validation correctly detects the malicious input and throws an IllegalStateException.

// Compliant Solution

static function validateInput(input : String) : String {
// Normalize
var s : String = Normalizer.normalize(input, Form.NFKC);

// Validate by checking for angle brackets
var pattern : Pattern = Pattern.compile("[<>]");
var matcher : Matcher = pattern.matcher(s);
if (matcher.find()) {
// Found blacklisted tag
throw new IllegalStateException();
}
return s;
}

Non-Compliant Code Example (Unicode)

The W3C recommends using Normalization Form C (NFC) for all web content to avoid issues with characters that look different but are equivalent. For input validation, Normalization Form KC (NFKC) is often better because it handles compatibility-equivalent characters that might be treated differently.

However, normalization should not be applied to all Unicode text without care, as it can remove important formatting and distinctions, affecting the text's meaning and conversion to older character sets.

Here is an example of non-compliant code that normalizes to Form C before validating string data for browser rendering.

// Non-compliant Code Example

// Returns a string if it doesn't contain angle bracket characters
// Otherwise throws IllegalStateException
static function validateInput(input : String) : String {
// Normalize
var s : String = Normalizer.normalize(input, Form.NFC)

// Validate
var pattern : Pattern = Pattern.compile("[<>]") // Check for angle brackets
var matcher : Matcher = pattern.matcher(s)
if (matcher.find()) {
// Found blacklisted tag
throw new IllegalStateException();
}
return s;
}

Unfortunately, the small less-than sign (\uFE64) and the small greater-than sign (\uFE65) are not converted by normalizing to Form C:

Normalizing

Consequently, the string s is not rejected and may be rendered by a browser that interprets the string as a <script> tag.

Compliant Solution (Unicode)

This compliant solution normalizes to Form KC before validating the string data. Consequently, input validation correctly rejects the string by throwing IllegalStateException.

// Compliant Solution

static function validateInput(input : String) : String {
// Normalize
var s : String = Normalizer.normalize(input, Form.NFKC)

// Validate by checking for angle brackets
var pattern : Pattern = Pattern.compile("[<>]")
var matcher : Matcher = pattern.matcher(s)
if (matcher.find()) {
// Found blacklisted tag
throw new IllegalStateException();
}
return s
}

Non-Compliant Code Example (Case)

This non-compliant code example defines the isValidateUserName function, determining whether a new username is valid. The only requirement is that the username cannot be "admin" because an account with that name may be misconstrued as privileged.

// Non-compliant Code Example

static function isValidateUserName(username: String) : Boolean{
if (username === "admin") return false
return true
}

However, this function fails to check for other spellings of this username, such as “Admin” or “ADMIN”, that may be similarly misconstrued, especially on systems that perform case-insensitive comparisons.

Compliant Solution (Unicode)

This compliant solution converts the argument to lowercase before comparing value equality with the "admin" string.

// Compliant Solution

static function isValidateUserName(username: String) : Boolean{
var username_lc = username.toLowerCase()
if (username_lc === "admin") return false
return true
}

Normalizing the username to lowercase before performing the comparison (validation) reduces the search space needed to determine the value equality of two strings in a case-insensitive manner.

Risk Assessment

Validating input before normalization allows attackers to bypass filters and other security mechanisms. In addition, it can result in the execution of arbitrary code.

RuleSeverityLikelihoodRemediation CostPriorityLevel
STR00-GHighLikelyHighL6L2

Additional Resources