Percent-encoding: how URLs handle the characters they weren't designed for

Every URL you've ever copied and pasted has probably had a %20 or %2F in it at some point. That % followed by two hex digits is percent-encoding (sometimes called URL encoding), and it's the mechanism the web uses to squeeze arbitrary data through a channel that was originally designed for a very limited set of ASCII characters.

The basic idea is dead simple. Take a byte, write it as % plus two hexadecimal digits. A space (byte value 0x20) becomes %20. A hash sign (0x23) becomes %23. Done.

But as with most things on the web, the devil is in three decades of competing specifications, edge cases, and a form encoding that decided spaces should be plus signs instead.

The %XX mechanism

Percent-encoding represents a single octet as a triplet: the percent character % followed by two hexadecimal digits that correspond to the numeric value of that byte¹. So the letter A (which you'd never actually need to encode) would be %41 if you did. The space character, byte value 32 in decimal or 0x20 in hex, becomes %20.

The hex digits can be uppercase or lowercase -- %2f and %2F both represent the forward slash. RFC 3986 says producers should use uppercase for consistency, but consumers must accept both¹.

Here's what the decision process looks like for any character going through percent-encoding:

Flowchart showing the percent-encoding decision process for a given character

Percent-encoding decision flowchart

One thing that trips people up: percent-encoding operates on bytes, not characters. For ASCII characters, there's a 1:1 mapping between character and byte, so @ (0x40) always becomes %40. But for anything outside ASCII -- say, a Chinese character or an accented letter -- you first need to convert the character to a sequence of bytes using some character encoding (almost always UTF-8 these days), and then percent-encode each resulting byte individually.

The letter e (U+00E9), encoded in UTF-8, produces two bytes: 0xC3 and 0xA9. So it becomes %C3%A9. A character like zhong (U+4E2D) turns into three UTF-8 bytes and becomes %E4%B8%AD.

RFC 3986 and the character class system

The current authoritative standard for URI syntax is RFC 3986, published in January 2005 by Tim Berners-Lee, Roy Fielding, and Larry Masinter¹. It divides characters into clean categories:

Unreserved characters can appear anywhere in a URI without encoding:

A-Z a-z 0-9 - . _ ~

That's it. Letters, digits, hyphen, period, underscore, and tilde. These should never be percent-encoded, and if they are, a conforming implementation must treat %41 the same as A¹.

Reserved characters have syntactic meaning within URI components:

General delimiters (gen-delims): : / ? # [ ] @
Sub-delimiters: ! $ & ' ( ) * + , ; =

Whether a reserved character needs to be encoded depends on where it appears. A / in the path component is a path separator and must not be encoded. But if the literal character / is part of a query parameter value, it should be encoded as %2F to avoid ambiguity. This context-dependence is one of the trickiest parts of the whole system -- the same character can be perfectly legal in one URI component and must be encoded in another.

Everything else -- control characters, spaces, non-ASCII bytes, characters like `{ } | \ ^ `` -- must always be percent-encoded.

Three decades of evolving standards

The percent-encoding mechanism didn't spring from nowhere. It evolved across several RFC revisions, each one refining the character classifications and fixing ambiguities left by its predecessor.

Timeline showing the evolution of URI encoding standards

Evolution of URL/URI encoding standards from 1994 to the present

RFC 1738 (1994) -- the original URL spec

The first formal URL specification, authored by Berners-Lee, Masinter, and McCahill, introduced the %HH escape mechanism². It categorized characters as "safe," "unsafe," and "reserved" -- but the definitions were vague compared to what came later. Characters like ~ and | were explicitly listed as "unsafe" and had to be encoded. The tilde thing in particular caused years of headaches with Unix home directory URLs (http://example.com/~user/ technically needed encoding under RFC 1738).

RFC 2396 (1998) -- URI generic syntax

Berners-Lee and Fielding rewrote the specification, this time for URIs rather than just URLs³. The character sets got reshuffled: ~ moved to the unreserved set (finally), +, $, and , were added to the reserved set, and the terminology became more precise. The spec also introduced the concept that escaping should happen during URI construction, not transmission -- "a URI is always in an escaped form."

RFC 3986 (2005) -- the current standard

The big cleanup. Reserved characters were split into gen-delims and sub-delims. The unreserved set was trimmed down to just A-Z a-z 0-9 - . _ ~ (notably, !, *, ', (, and ) moved from unreserved to sub-delims). The spec was explicit about normalization: percent-encoded representations of unreserved characters should be decoded, and uppercase hex digits should be preferred¹.

One subtle but important change: RFC 3986 acknowledged that different URI components have different encoding requirements. A character that's a delimiter in one component may be perfectly safe data in another.

UTF-8 and internationalized URIs (RFC 3987)

RFC 3986 only deals with ASCII bytes. But the web is multilingual. How do you put https://example.com/cafe/menu in a URL?

RFC 3987, published in January 2005 by Martin Duerst and Michel Suignard, defined Internationalized Resource Identifiers (IRIs) -- URIs that can contain Unicode characters⁴. The conversion from IRI to URI is straightforward: take each non-ASCII character, encode it as UTF-8, and percent-encode the resulting bytes.

So cafe becomes caf%C3%A9, and https://example.com/nihongo/ becomes a long string of percent-encoded UTF-8 bytes.

The spec is very firm on one point: UTF-8 is the only acceptable character encoding for this conversion⁴. No Latin-1, no Shift_JIS, no "whatever the server happens to use." This was a deliberate choice to avoid the encoding detection nightmares that plagued HTML for years. In practice, browsers had already converged on UTF-8 for URL encoding by the time RFC 3987 was published, so the spec formalized existing behavior.

IRIs are what you see in the browser's address bar -- the nice, human-readable version. The actual HTTP request on the wire uses the percent-encoded URI form.

The form encoding oddity: application/x-www-form-urlencoded

And then there's form encoding, which does its own thing entirely.

When an HTML form submits data with method="GET" (or method="POST" with the default enctype), the browser encodes the form fields using application/x-www-form-urlencoded⁵. This format looks superficially like percent-encoding, but has one well-known deviation: spaces become + signs instead of %20.

name=John+Doe&city=New+York

This convention dates back to the very early web. The HTML 2.0 spec from 1995 defined form data encoding with + for spaces, and it stuck⁶. RFC 3986 has no concept of + meaning space -- if you see a + in a URI outside the query string, it's a literal plus sign. But in form-encoded query strings, + and %20 are both valid representations of a space.

The application/x-www-form-urlencoded format also has a much more aggressive encoding policy than RFC 3986. Per the WHATWG URL standard, it encodes everything except ASCII alphanumerics and the characters * - . _⁵. That means characters like ! and ~, which are unreserved in RFC 3986, get percent-encoded in form data.

I find this split genuinely annoying in practice. You can't just write one URL-encoding function and use it everywhere -- you need to know whether you're encoding a path segment, a query parameter, or form data, because the rules are different for each.

The WHATWG URL standard and percent-encode sets

Browsers don't follow RFC 3986 exactly. The WHATWG URL Living Standard defines its own layered system of percent-encode sets, each one a superset of the previous⁷:

C0 control percent-encode set -- C0 controls (0x00-0x1F) and everything above 0x7E
Fragment percent-encode set -- adds space, ", <, >, and backtick
Query percent-encode set -- adds #
Path percent-encode set -- adds ?, {, }
Userinfo percent-encode set -- adds /, :, ;, =, @, and more

This is how browsers actually decide what to encode in each part of the URL. If you type https://example.com/path with spaces/page?q=hello world#section into Chrome, it'll encode the spaces in the path as %20 but might leave certain characters alone in the fragment that it would encode in the path.

encodeURI vs encodeURIComponent in JavaScript

This is probably the most common practical question developers run into: which JavaScript function should I use?

Comparison chart showing which characters encodeURI and encodeURIComponent encode differently

encodeURI vs encodeURIComponent comparison

encodeURI() is designed for encoding a complete URI. It leaves URI-structural characters alone -- : / ? # [ ] @ ! $ & ' ( ) * + , ; = -- and only encodes things that can't appear anywhere in a valid URI⁸.

encodeURIComponent() is designed for encoding a single URI component (like a query parameter value). It encodes the structural characters too, because within a component value, a / or & is data, not syntax⁹.

const url = "https://example.com/search?q=coffee & tea";

encodeURI(url);
// "https://example.com/search?q=coffee%20&%20tea"
// Problem: the & is preserved, splitting the query incorrectly

// Correct approach: encode the parameter value separately
"https://example.com/search?q=" + encodeURIComponent("coffee & tea");
// "https://example.com/search?q=coffee%20%26%20tea"

The rule of thumb: use encodeURIComponent() for individual values, encodeURI() for complete URIs where the structure is already correct. In practice, I reach for encodeURIComponent() about 95% of the time.

Neither function produces application/x-www-form-urlencoded output -- they both use %20 for spaces, never +. If you need the form encoding behavior, the URLSearchParams API handles it:

const params = new URLSearchParams({ q: "coffee & tea" });
params.toString(); // "q=coffee+%26+tea"

Note how URLSearchParams uses + for the space but still percent-encodes the & -- that's the form encoding format at work.

Other languages, other quirks

Every major programming language has URL encoding functions, and they don't all agree on the details.

Python splits it into urllib.parse.quote() (RFC 3986 style, spaces become %20, forward slashes are safe by default) and urllib.parse.quote_plus() (form encoding style, spaces become +, slashes get encoded)¹⁰. The default safe characters for quote() include /, which makes it suitable for encoding paths but not individual query parameters.

PHP has rawurlencode() (RFC 3986 style) and urlencode() (form encoding, + for spaces). The naming isn't great -- you'd expect the "raw" one to be the simpler one, but it's actually the RFC-conformant one.

Java has URLEncoder.encode() which does form encoding (+ for spaces) and nothing built-in for pure RFC 3986 encoding until you write your own or pull in a library. java.net.URI handles some of it but the API is... not beloved.

Go has url.QueryEscape() (form encoding) and url.PathEscape() (path encoding). Clean separation, sensible naming. I wish other standard libraries were this clear about it.

Common pitfalls

Double encoding

If you encode a URL that already contains percent-encoded sequences, the % signs themselves get encoded: %20 becomes %2520. The result is a URL where the literal text %20 appears in the decoded value, which is wrong. RFC 3986 explicitly warns against this: "implementations must not percent-encode or decode the same string more than once"¹.

This happens most often when one layer of your stack encodes a URL and then another layer encodes it again, not knowing the first layer already did it. ORMs, HTTP client libraries, and web frameworks are frequent offenders.

Original:          hello world
After first encode: hello%20world
After double encode: hello%2520world   (broken!)

Path vs query vs fragment encoding

A / in a URL path is a delimiter. A / in a query parameter value is data. If you use encodeURI() on a query value that contains a slash, the slash won't be encoded, and your query string might break. Use encodeURIComponent() for values.

The plus sign ambiguity

In a query string produced by a form submission, + means space. In a URI path, + is a literal +. If you decode a path segment using a form-decoding function, plus signs will incorrectly become spaces. I've seen this bug in production systems more times than I'd like to admit.

Encoding non-UTF-8 data

RFC 3987 mandates UTF-8 for IRI-to-URI conversion, and modern browsers use UTF-8 for all URL encoding. But older systems might use Latin-1 or Windows-1252. If your server decodes percent-encoded bytes as UTF-8 but the client encoded them as Latin-1, you'll get garbled text. The character e is 0xE9 in Latin-1 (one byte, %E9) but 0xC3 0xA9 in UTF-8 (two bytes, %C3%A9). Getting this wrong silently corrupts data.

Real-world examples

A Google search URL for "what is 100%?" encodes the space, the percent sign, and the question mark in the query parameter:

https://www.google.com/search?q=what+is+100%25%3F

The %25 is the percent sign itself (byte 0x25), and %3F is the question mark. The + is a space, because Google uses form encoding in its query strings.

An API call with JSON in a query parameter gets ugly fast:

https://api.example.com/data?filter=%7B%22status%22%3A%22active%22%7D

That's {"status":"active"} percent-encoded. The curly braces (%7B, %7D), quotes (%22), and colon (%3A) all need encoding because they're not unreserved characters.

A file path with spaces on a local file URI:

file:///Users/name/My%20Documents/report%20(final).pdf

The parentheses could technically be left unencoded here (they're sub-delimiters), but most tools encode them anyway, which is harmless -- RFC 3986 says unnecessary encoding must be accepted by consumers.

Citations

RFC 3986: Uniform Resource Identifier (URI): Generic Syntax. Retrieved March 1, 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
RFC 1738: Uniform Resource Locators (URL). Retrieved March 1, 2026 ↩
RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax. Retrieved March 1, 2026 ↩
RFC 3987: Internationalized Resource Identifiers (IRIs). Retrieved March 1, 2026 ↩ ↩²
WHATWG: URL Living Standard -- application/x-www-form-urlencoded. Retrieved March 1, 2026 ↩ ↩²
RFC 1866: Hypertext Markup Language - 2.0. Section 8.2.1, Form submission. Retrieved March 1, 2026 ↩
WHATWG: URL Living Standard. Retrieved March 1, 2026 ↩
MDN: encodeURI(). Retrieved March 1, 2026 ↩
MDN: encodeURIComponent(). Retrieved March 1, 2026 ↩
Python Software Foundation: urllib.parse -- Parse URLs into components. Retrieved March 1, 2026 ↩

Updated: March 1, 2026