Skip to main content

INJ05-G: Use Allow Lists For Validating URLs


Introduction

Improper URL validation can lead to serious security vulnerabilities, such as server-side request forgery (SSRF), URL redirection, and host header injection. An attacker can exploit these vulnerabilities to gain unauthorized access to resources, steal sensitive data, or disrupt services. By injecting malicious data into URLs, attackers can manipulate the behaviour of web applications, potentially redirecting them to unintended locations or causing them to perform unauthorized actions.

URI (Uniform Resource Identifier) and URL (Uniform Resource Locator) are often used interchangeably, but they are not the same. A URI identifies a specific resource, such as a webpage, book, or document. A URL is a specific type of URI that identifies the resource and provides the means to access it, such as through HTTP, HTTPS, FTP, etc. For example, https://www.guidewire.com is a URL.

URI, URN, URL diagram

If the protocol (HTTPS, FTP, etc.) is either present or implied for a domain, it should be referred to as a URL, even though it is also a URI. All URLs are URIs, but not all URIs are URLs, [RFC 3986(https://tools.ietf.org/html/rfc3986)], which replaces RFC 2396, defines the generic syntax for URIs as consisting of several components, as shown below:

URI, URN, URL diagram

Security Risks in URL Handling

When user input is involved in constructing URLs, there are several potential vulnerabilities that attackers can exploit:

1. URL Injection:

If the entire URL is provided by the user, an attacker can inject malicious schemes. For example, https://www.example.com combined with user-supplied data might result in:

"https://www.example.com" + userSuppliedRestOfTheUrl

This allows an attacker to manipulate the URL, possibly injecting an @ symbol to redirect the server to another location:

"https://www.example.com*www.attacker.com"

This can trick the server into sending sensitive information to the attacker's domain.

2. Credential Injection:

Attackers can add an @ symbol to consume the intended location as a credential, add a domain suffix to a bare hostname, or even append a port using a colon (host:port).

3. Path Injection:

Injection after the path-starting / is problematic. For instance, a ../ sequence injected into the path may navigate up the directory structure, which might not be pre-resolved by some HTTP client libraries. Proper handling involves URI-encoding sequences so that / becomes %2F.

4. Query and Fragment Injection:

Query parameters and fragments can be similarly exploited. Proper parsers take the first ? and # as the start of the query parameters and fragment, respectively. Improper handling can lead to truncation and injection attacks, especially since HTTP clients do not send fragment components, making such truncations silent.

An absolute URI specifies a scheme; a relative URI does not. A hierarchical URI is either an absolute URI or a relative URI (that does not specify a scheme). A hierarchical URI is subject to further parsing according to the syntax:

[scheme:][//authority][path][?query][#fragment]

The scheme-specific part of a hierarchical URI consists of the characters between the scheme and fragment components. RFC 7595 defines guidelines and registration procedures for URI schemes. The Internet Assigned Numbers Authority (IANA) maintains the registry of permanent, provisional, and historical schemes. Common schemes supported by Java include:

SchemesDescription
FILEHost-specific file names
FTPFile Transfer Protocol (FTP)
GOPHERGopher Protocol *
HTTPHyperText Transfer Protocol (HTTP)
HTTPSSecure HyperText Transfer Protocol (HTTPS)
NEWSUSENET news
NNTPUSENET news using Network News Transfer Protocol (NNTP)
WAISWide Area Information Server (WAIS) protocol

A server-based authority parses according to the familiar syntax:

[user-info@]host[:port]

Nearly all URI schemes currently in use are server-based.

In addition to these URL schemes, modern browsers support many pseudo-schemes such as javascript, data, and view-source. These schemes enable advanced features such as encapsulating encoded documents within URLs, providing legacy scripting features, or providing access to internal browser information and data views.

Two different URIs may identify the same resource in the view of the user or publisher of that resource. Comparing two URIs can confidently establish that they are equivalent and identify the same resource. However, it is impossible to ensure that they identify different resources. This is why "allow" lists work for validating URIs, but "deny" lists do not.

Unvalidated or improperly validated URLs may be used in a server-side request forgery attack to read or modify resources they cannot otherwise access. The attacker supplies or changes a URL to which code running on the server fetches or submits data. Validation can also be used to prevent URL redirection/host header injection exploitation.

URIs/URLs are serialized data structures and must be parsed appropriately. Existing parsers have significantly different behaviour. You may need to verify the host for some parsers, while for others, you might need to verify the path.

Non-compliant Code Example

This non-compliant code example attempts to prevent the use of the file, ftp, and gopher schemes:

// Noncompliant Code Example

public function urlValidatorDenyList(url : String) : boolean {
return not (url.startsWith("file://")
or url.startsWith("ftp://")
or url.startsWith("gopher://"))
}

This function takes a deny list approach by testing if the URL specifies any of the three disallowed schemes. This function may appear to work but suffers from many problems and fails when:

  • the scheme uses capital letters, for example, ftp://www.example.com/
  • the URL contains leading white space characters
  • the URL uses an unexpected scheme, such as “wais” that should have been included in the deny list

In general, it is not possible to use a deny list when you cannot reliably enumerate all the schemes to deny.

Non-compliant Code Example

This non-compliant code example attempts to use an allow list approach to ensure that only URL only uses the https scheme:

// Noncompliant Code Example

public function urlValidator(url : String) : boolean {
return url.startsWith("https://")
}

This function solves the problem of not being able to enumerate all denied schemes by specifying only allowed schemes. However, it suffers from all the other issues of the previous non-compliant code example. It is also an excellent example of rolling your own solution instead of reusing existing, robust APIs.

Compliant Solution (EntityResolver)

This compliant solution validates the URL using the UrlValidator class from the Apache Commons Validator package.

// Compliant Solution

uses org.apache.commons.validator.routines.UrlValidator

public function urlValidator(url : String) : boolean {
final var allowedSchemes: String[] = { "https" }
var validator = new UrlValidator(allowedSchemes)
return validator.isValid(url)
}

The Apache Commons Validator package contains several standard validation routines. The UrlValidator class validates the URL by checking the scheme, authority, path, query, and fragment. This example uses an allow list approach to validate that the https scheme is used.

The Apache Commons Validator package is not without problems. The implementation is quite old and based on the outdated RFC 2396 specification.

Risk Assessment

Failure to properly validate URLs using allow listing techniques can enable server-side request forgery and related attacks. URL validation by itself is not a complete solution for these attacks.

RuleSeverityLikelihoodRemediation CostPriorityLevel
INJ05-GHighLikelyMediumL12L1

Additional resources