To ensure that your web files and pages are accessible to a wide range of users with various different devices and operating systems, it’s important to use valid URL characters. Unsafe characters are known to cause compatibility issues with various browser clients, web servers, and even lead to incompatibility issues with web application firewalls.
In this post I’ll be summarizing OWASP best practices and rfc3986 documentation to describe what a bad path is, why you should use valid URL characters, and how to properly encode characters to avoid problems.
Contents:
- What is a valid URL?
- What’s the difference between URLs, URIs, and Paths?
- Unsafe characters
- Reserved characters
- Safe characters
What is a valid URL?
A valid URL (Uniform Resource Locator) is a string of characters that specify the location of resources on the internet or private networks (intranets).
Let’s break the components of a URL down a bit more in detail:
- Uniform – as in a codified / standardized in RFC 3986 and its many related articles.
- Resource – as in any server resources and can be used by a network connected client which might be an HTML page, an image or any other file.
- Location – a location to this resource is provided, it is common for a URL to be called a URI, where the I stands for Identifier.
A valid URL may include the following components:
- A protocol, such as “http”, “https”, “ftp”, etc. These may also be expressed as 433 the default port for HTTPS
- A domain name or IP address, such as “myawsomesite.com” or “192.168.0.1”.
- An optional path, such as “/index.php” or “/myftpfolder/”.
- An optional query string, such as “?param=value
- An optional fragment, such as “#section1”
Example of a valid URL:
https://myawesomesite.com/index.php?param1=value1
While not all components are required beyond a protocol and a domain name or IP address, this format must be followed.
What’s the difference between URLs, URIs, and Paths?
There is quite a bit of confusion between the meaning of URLs, URIs, and Paths. So let’s take a brief moment to examine the differences between them.
URL (Uniform Resource Locator)
A URL (Uniform Resource Locator) is a specific type of URI (Uniform Resource Identifier) that is used to identify the location of a resource on the internet. It specifies the protocol to be used to access the resource, as well as the address of the resource on the internet.
For example, “https://sucuri.net/path/to/exampleresource” is a URL.
URI (Uniform Resource Identifier)
A URI (Uniform Resource Identifier) is a string of characters that identify a name or a resource on the internet. URIs can be broken down into two types: URLs and URNs (Uniform Resource Names). A URL is a specific type of URI that specifies the location of a resource on the internet, while a URN is a type of URI that identifies the resource by name, rather than by location.
Some examples of a URI might include:
- “mailto:person@example.com” – specifies an email address
- “https://sucuri.net” – specifies the location of a resource on the internet
- “file://path/to/some/file” – specifies the location of a file on your local computer
Path
A path is a sequence of directories or folders that specifies the location of a file or resource on a computer or network. A path can be either absolute or relative. An absolute path is a complete path that starts from the root directory and specifies the exact location of the file or resource, while a relative path is a partial path that specifies the location of the file or resource relative to the current directory.
In summary:
- A URL is a specific type of URI that specifies the location of a resource on the internet
- A URI is a string of characters that identify a name or a resource on the internet
- A path is a sequence of directories or folders that specifies the location of a file or resource on a computer or network.
For the purpose of this article, we will be using the term “path” and will not be including instructions for the protocol and (sub) domain names, which have their own set of restrictions that you can reference at: https://www.ietf.org/rfc/rfc1034.txt
Unsafe characters in URLs
There are several characters that are not safe to allow in a path (or URI) because they can cause problems with the way the URL is interpreted by web browsers, web servers, and WAFs (web application firewalls). These characters should be encoded or avoided to avoid potential issues or security risks.
For example, some characters are used as delimiters to separate different parts of the URL. If your path contains characters that can be used as a delimiter, it may cause it to be interpreted incorrectly as a separator rather than part of the actual path.
In other cases, certain combinations of unsafe characters open up the potential for vulnerabilities that can potentially allow a bad actor to launch malicious attacks on a web server. So it’s important to avoid using unsafe characters in paths whenever possible.
Which characters are unsafe?
Some characters, such as space and angle brackets, are not safe to use in a path regardless of whether they are encoded or not. These characters should be avoided when constructing a URL.
" < > # % { } | \ ^ ~ [ ] `
Space: A space character is not safe to use in a path because it is not a valid URL character. Spaces are often encoded as %20 or + when they appear in a URL, but they can cause problems with the way the path is interpreted by web browsers, web servers, and WAFs.
Angle brackets: Angle brackets < and > are not safe to use in a path because they are used to enclose HTML tags.
Quotation marks: Quotation marks ” and ‘ are not safe to use in a path because they are used to enclose attribute values in HTML.
Pipe: The pipe character | is not safe to use in a path because it is used as a separator in some systems.
Backslash: The backslash character \ is not safe to use in a path because it is used as an escape character in some systems and signifies a folder in some operating systems.
Curly braces: Curly braces { and } are not safe to use in a path because they are used to enclose blocks of code in some programming languages.
Square brackets: Square brackets [ and ] are not safe to use in a path because they are used to enclose attributes in HTML.
As a rule of thumb, avoid using any of these unsafe characters in URLs to ensure that they’re correctly interpreted by web browsers and web servers — and to avoid causing false positive blocks in web application firewalls.
Reserved characters in URLs
Reserved characters are characters that have special meanings and are reserved for specific uses. A reserved character may be used to specify the structure of a URL or separate different parts of the URL.
The following are considered reserved characters:
; / ? : @ & = + $ ,
These reserved characters should be encoded whenever they are used in a URL or path for any purpose other than their intended use.
To properly encode a reserved character, replace them with a percent sign (%) followed by the corresponding ASCII code in hexadecimal notation. For example: /Hello_World!/ should become /Hello_World%21/ to ensure the path is interpreted correctly.
Here is a comprehensive list of the reserved characters and their corresponding percent-encoded values:
Character | Encoded Value |
---|---|
! | %21 |
# | %23 |
$ | %24 |
& | %26 |
( | %28 |
) | %29 |
* | %2A |
+ | %2B |
, | %2C |
/ | %2F |
: | %3A |
; | %3B |
= | %3D |
? | %3F |
@ | %40 |
Reserved characters can always be used unencoded for the purpose which is intended. For example, / for a folder in a path or ? as the start of a query string.
Safe characters in URLs
Here is a list of characters that can safely be used as part of a path, as they are neither unsafe nor reserved and will never cause an issue. There is no need to encode these characters.
- Alphanumeric characters: A-Z, a-z, and 0-9
- Hyphen: –
- Period: .
- Underscore: _
- Tilde: ~
Closing thoughts
If you absolutely must use unsafe or unencoded reserved characters for the purpose not intended, then you’ll want to allowlist the user’s IP address in any firewall or WAF to avoid false positive blocks.
For example, use this method for the Sucuri WAF: https://docs.sucuri.net/website-firewall/whitelist-and-blacklist/whitelist-an-ip-address/
But I strongly discourage this practice as it can lead to security issues which negates the purpose of using a web application firewall in the first place.
Developers continually ignore the Internet standards track protocol set out by rfc3986 which causes a plethora of issues, but for the record, this is where we are today.
Ironically, Google is amongst one of the biggest offenders with their ugly tracking data they add to paths for their services — for example:
/?_gl=1*8jhbdshjds*_ga*dhcvb8kfadjbh7vshjj7.*_ga_HC78665MNW*HHHGCCPbgfx98mnxj..&_ga=2.3650025.111989854936.1899898-76365492667450.1877495563
There are multiple support and forum articles confirming that they are aware of the issue and have related issues with their servers and services:
- https://support.google.com/merchants/answer/160038?hl=en
- https://developers.google.com/search/docs/crawling-indexing/url-structure
However, you can sidestep a number of security and compatibility issues by avoiding unsafe characters and encoding reserved characters whenever they are used in URLs or paths outside of their intended purpose.