Punycode explained
Language revolution in the address bar
A Spanish website elpaís.es using IDN
From its inception, the internet was designed to be a global network; however, there was one notable limitation: domain names could only use Latin characters. This alphabet barrier wasn't intentional; it was a consequence of technical design decisions made at the time when the global standards for digital representations of characters were still in development. The basic building block of the entire system, the Domain Name System (DNS), was conceived in 1983 when Paul Mockapetris published the original specifications in RFC 882 and RFC 8831. DNS worked exclusively with a character subset of ASCII, allowing only Latin letters (a-z), digits (0-9), and the hyphen (-) -- the so-called LDH rule (Letters, Digits, Hyphens)23.
That technical choice, which initially provided functionality and universality in the early phase of development, became a significant obstacle over time. Though the content of web pages and emails could be in any language, the domain name (part of the URL) still had to be written in the Latin alphabet. Such an "alphabet barrier" contributed to the so-called "digital divide," particularly in countries where languages that don't use Latin script are dominant and English isn't widely spoken. For such a user, it was often easier to memorize a chain of numbers (an IP address) rather than a series of unknown glyphs4.
Ironically, what was meant to be easy to remember for some became a cultural barrier for the rest of the world. The Unicode project -- the essential international standard for non-Latin characters -- was conceptualized in 1987 by Joe Becker at Xerox and Lee Collins and Mark Davis at Apple5. Becker published the initial design document "Unicode 88" in August 19886, and the Unicode Consortium was formally incorporated in January 19917. The first volume of Unicode 1.0 was published in October 19918. That gap of eight years between DNS (1983) and Unicode 1.0 (1991) illustrates how design choices from an early phase of development can have unintended, global impacts. The Punycode development begins with an effort to bridge exactly this divide.
A smart solution
Efforts around "internationalization of domain names" (IDN) started to appear in the mid-1990s. After years of debates and a lot of competing proposals, the IETF produced what became the first standard framework -- IDNA 2003 (Internationalizing Domain Names in Applications), published as RFC 34909. As part of the same effort, in March 2003 the IETF approved RFC 3492, which described the Punycode algorithm10. Its author, Adam Costello from UC Berkeley, designed it as a neat and efficient solution able to losslessly and reversibly transform any Unicode string into a plain ASCII subset.
Punycode is not a brand new algorithm in itself. It's a specific instance of a more generic algorithm called Bootstring, which enables the representation of any string from a larger character set (Unicode in our case) using a smaller set of characters (a subset of ASCII in our case)11. The Bootstring concept was designed to be universal and functional across most scripts, while striving to be self-optimized -- it adapts its internal state to the frequency distribution of characters in a particular string, which keeps the encoded output short10.
Origin of the name: How is it "puny"?
The name is a catchy wordplay that rhymes with Unicode and refers to three ways the algorithm is "puny":
-
Small subset of characters: only lowercase letters, digits, and the hyphen are used23.
-
Short encoded version: encoded strings aren't much longer than the original. This isn't just elegant -- it's necessary, because DNS limits the length of each label to 63 characters12.
-
Small implementation: the reference C implementation in the RFC is surprisingly compact.
The power of Punycode lies in its "puniness" -- its simplicity. It manages to achieve maximum significance (universal across all characters, therefore applicable to all languages) with minimal requirements.
The algorithm: What's behind the xn--?
Punycode-encoded labels are denoted by a special prefix: "xn--", so everything encoded starts with it. The prefix was selected by IANA in February 2003 through a voting process among IESG members, where "XN" received 17 votes and won13. It was formalized within the ACE (ASCII-Compatible Encoding) specification in IDNA 20039.
The encoding process has multiple phases:
-
ASCII character separation: All ASCII characters (those that don't need to be encoded) from the input string are copied to the start of the output string.
-
Add the hyphen separator
-:
If there were any ASCII characters, a separator hyphen is added after them (e.g., for "čáslav", the ASCII characters will be followed by a trailing hyphen like "slav-")11.
We have to realize that the hyphen itself is an ASCII character. Hyphens can be part of the input string, and if they are present, they'll be appended to the output like other ASCII characters. That doesn't create any ambiguity -- the last hyphen in the output is always the separator, denoting where the ASCII characters end.
-
Encoding non-ASCII characters: The characters beyond ASCII are encoded using the Bootstring algorithm with parameters specific to Punycode, resulting in a sequence of a-z and 0-911.
-
Adding the ACE prefix
xn--: In domain names, the Punycode-encoded label is prefixed withxn--to denote ACE (ASCII-Compatible Encoding)913.
So, for example, if we want to encode the string "čáslav" (a Czech town):
- ASCII character separation:
From "čáslav", the ASCII characters "slav" are placed at the start of the output.
- Add the hyphen separator
-:
As the input contains both ASCII and non-ASCII characters, a hyphen is added, so the output is slav-
- Encoding non-ASCII characters:
The č and á characters are encoded using the Bootstring algorithm into 4na7x and appended to the end of the output, so the resulting Punycode output is slav-4na7x
- Adding the ACE prefix
xn--:
To denote the Punycode-encoded text in the domain name, we prepend the xn-- prefix (ACE -- ASCII-Compatible Encoding).
So the string we can use in our DNS setup is xn--slav-4na7x
Examples
The table below shows how different types of input are transformed. Generated using tr46 UTS #46 processing, by a Punycode library tr46
| Input | Nameprepped, Punycode encoded and ACE prefixed output | Description |
|---|---|---|
hello | hello | Simple ASCII word |
world-test | world-test | ASCII word with hyphen |
café | xn--caf-dma | French word with é |
naïve | xn--nave-6pa | French word with ï |
résumé | xn--rsum-bpad | French word with é |
Zürich | xn--zrich-kva | German city with ü (uppercase lowercased by Nameprep) |
münchen | xn--mnchen-3ya | German city with ü |
español | xn--espaol-zwa | Spanish word with ñ |
português | xn--portugus-q1a | Portuguese word with ê |
français | xn--franais-xxa | French word with ç |
تست | xn--pgba0a | Arabic |
δοκιμή | xn--jxalpdlp | Greek |
פרובה | xn--5dbgb3dua | Hebrew |
गुजराती | xn--31bky1czdnc | Gujarati word |
ไทย | xn--o3cw4h | Thai word |
中文 | xn--fiq228c | Chinese word |
日本語 | xn--wgv71a119e | Japanese word |
한국어 | xn--3e0bk47br7k | Korean word |
école.fr | xn--cole-9oa.fr | French school domain with è |
bücher.de | xn--bcher-kva.de | German books domain with ü |
niño.ws | xn--nio-8ma.ws | Spanish word domain with ñ |
tølløse.dk | xn--tllse-vuac.dk | Danish place domain with ø |
العربية.ws | xn--mgbcd4a2b0d2b.ws | Arabic |
test中文 | xn--test-3f5fy05j | Mixed ASCII and Chinese |
café123 | xn--caf123-dva | Mixed letters and numbers with é |
á | xn--1ca | Single letter with acute accent |
ñ | xn--ida | Single letter with tilde |
ø | xn--pda | Single Scandinavian letter |
测试 | xn--0zwm56d | Chinese |
münchen.bayern | xn--mnchen-3ya.bayern | German domain with regional TLD |
IDNA 2003 vs. IDNA 2008: Two generations of the standard
The original IDNA 2003 specification (RFC 3490) served its purpose, but had a fundamental design limitation: it was hardwired to Unicode 3.29. Every time the Unicode Consortium published a new version with additional scripts and characters, the IDNA spec couldn't accommodate them without a protocol revision. That's a problem when Unicode grows from around 95,000 characters in version 3.2 to over 154,000 in version 16.014.
The IETF published IDNA 2008 in August 2010 as a suite of RFCs (5890 through 5895)151617. The new framework was designed to be independent of any specific Unicode version -- it algorithmically derives which code points are allowed based on their Unicode properties, rather than relying on a static table17.
The practical differences can bite, though. The most famous example is the German Eszett (ß) -- under IDNA 2003, faß.de got mapped to fass.de irreversibly (via Nameprep in RFC 349118). Under IDNA 2008, ß is a valid character (classified as PVALID), so faß.de and fass.de are treated as distinct domains17. Similarly, the Greek final sigma (ς) was mapped to regular sigma (σ) in IDNA 2003, while IDNA 2008 disallows it outright.
This created a real interoperability headache during the transition. A user typing faß.de in a browser using IDNA 2003 would be taken to one domain, while a browser implementing IDNA 2008 would go to a different one. The Unicode Consortium stepped in with UTS #46 (Unicode IDNA Compatibility Processing), which provides a mapping layer that allows client software to work with domains registered under either standard19. In practice, most browsers today use UTS #46 in nontransitional mode, which is fully compatible with IDNA 2008.
The smart solution created a vulnerability: IDN homograph attack
Punycode allowed using international character sets in domain names and made the internet more accessible for users from all over the world. The same technology, though, provided cybercriminals with a new attack vector. Attackers started to exploit the visual similarity of characters across different alphabets to create so-called IDN homograph attacks -- a form of exploit based on deceiving a user with a URL whose domain has one or more characters swapped for visually identical (or almost identical) but technically different characters2021.
The idea was first described in a 2001 paper by Evgeniy Gabrilovich and Alex Gontmakher from the Technion, Israel Institute of Technology22. To demonstrate the attack, they registered a homographed microsoft.com domain using Cyrillic characters c and o that are visually indistinguishable from their Latin counterparts. The paper was published in Communications of the ACM in February 2002 and laid out clearly that international domain names would inevitably be exploited for phishing.
And they were right. Over two decades later, homograph attacks remain a real and active threat.
The principle: How the homograph attack works
IDN homograph attacks exploit visually identical or very similar characters from different alphabets, so-called homoglyphs23. Unicode 16.0 contains 154,998 characters from various scripts14, many of which look almost identical but have different code points20.
Homoglyph examples:
| Latin character | Homoglyph | Origin | Unicode | Attack example |
|---|---|---|---|---|
| a (U+0061) | а (U+0430) | Cyrillic | U+0430 | аpple.com |
| o (U+006F) | о (U+043E) | Cyrillic | U+043E | gооgle.com |
| p (U+0070) | р (U+0440) | Cyrillic | U+0440 | рaypal.com |
| e (U+0065) | е (U+0435) | Cyrillic | U+0435 | еbay.com |
| c (U+0063) | с (U+0441) | Cyrillic | U+0441 | сisco.com |
| x (U+0078) | х (U+0445) | Cyrillic | U+0445 | eхpress.com |
The Cyrillic script is particularly dangerous here because a significant number of its lowercase letters are visually indistinguishable from Latin letters -- a, c, e, o, p, x, y, and several others. An attacker can register a domain that's entirely Cyrillic yet appears identical to an ASCII domain in the browser's address bar.
The apple.com attack that changed everything
The most famous real-world demonstration came in April 2017 when security researcher Xudong Zheng registered the domain xn--80ak6aa92e.com24. When rendered by a browser, this Punycode string decoded to what appeared to be apple.com -- every single character was a Cyrillic lookalike. Chrome, Firefox, and Opera all displayed it as apple.com in the address bar, complete with a valid SSL certificate.
Zheng had reported the bug to Chrome in January 2017 and received a $2,000 bounty. Chrome shipped a fix in version 58 (April 2017) that added whole-script confusable detection25. Firefox followed with similar protections. Safari, Internet Explorer, and Edge were not affected -- they had already been blocking all-Cyrillic domains that resembled Latin text on non-Cyrillic TLDs.
This single attack is probably the reason most developers have even heard of Punycode.
How browsers defend against homograph attacks
Each major browser has its own policy for deciding whether to display a domain in Unicode or fall back to raw Punycode in the address bar. The approaches differ in the details, but the core idea is the same: if a domain looks like it could be impersonating a well-known domain, show the user the xn-- Punycode instead.
Chrome uses one of the most elaborate systems25. For each label in a hostname, Chrome converts the ACE to Unicode via UTS #46 and then runs a series of checks. If any check fails, it displays Punycode. The checks include: mixed-script detection (a label mixing Latin and Cyrillic, for instance, is suspicious), whole-script confusable detection (all-Cyrillic labels that look like Latin are flagged unless the TLD is known to host Cyrillic domains like .ru or .ua), digit-spoof detection, and skeleton matching against a list of top domains. Despite all this, research has shown Chrome still misses roughly 40% of crafted homograph IDNs26.
Firefox takes a slightly different approach built around the "Moderately Restrictive" profile from Unicode Technical Standard #3923. All characters in a label must come from Common + Inherited + a single script, or from a small set of approved combinations (such as Latin + Han + Hiragana + Katakana, for Japanese mixed-script domains)27. Crucially, Firefox explicitly disallows mixing Latin with Cyrillic or Greek in the same label. For TLDs on Mozilla's whitelist -- those whose registries have their own anti-homograph policies -- Firefox displays Unicode unconditionally.
Safari blocks all IDNs that mix scripts and rejects all-Cyrillic and all-Greek labels on non-matching TLDs26. Apple's approach is arguably the most aggressive, which is why Safari wasn't vulnerable to the 2017 apple.com attack. The trade-off is that some legitimate IDN domains may be displayed as Punycode unnecessarily.
None of these defenses are perfect. A 2021 USENIX Security paper by Hu et al. systematically tested all major browsers and found that every single one could be bypassed with carefully chosen confusable characters from less common Unicode blocks26. The problem is fundamentally hard: any character that looks like another character is a potential weapon, and Unicode has hundreds of thousands of characters across dozens of scripts.
Current state of IDN adoption
Despite the security complications, internationalized domain names have been steadily growing. As of 2024, there are an estimated 4.4 million second-level IDN registrations globally, representing about 1.2% of all domain names28. That's not a huge number, but the distribution is interesting.
The largest IDN holdings by ccTLD are: .рф (Russia) with around 769,000 domains, .de (Germany) with 648,000 IDN domains, .cn (China) with 537,000, and .中国 (China, in Chinese characters) with about 164,00028. The Chinese script dominates IDN registrations under gTLDs at 49%, followed by Latin-script IDNs (characters with diacritics, mostly) at 28%.
On the TLD level itself, ICANN's IDN ccTLD Fast Track Process (approved in October 2009 in Seoul) enabled countries to register top-level domains in their own scripts29. The first IDN TLDs went live on May 5, 2010 -- Egypt (.مصر), Saudi Arabia (.السعودية), and the UAE (.امارات) in Arabic, and Russia (.рф) in Cyrillic29. As of June 2025, 151 TLDs have been delegated as IDNs, representing 37 languages across 23 scripts30. That's a meaningful infrastructure, even if the actual registration volumes remain modest.
85% of ccTLD registries and 41% of gTLD registries now support IDN registrations. The regional support rates tell a story too -- Europe at 88%, Asia at 87%, the Americas trailing at 68%28.
Universal Acceptance: the bigger problem
Having an IDN domain is one thing. Being able to actually use it everywhere is another.
Universal Acceptance (UA) is an initiative spearheaded by ICANN's Universal Acceptance Steering Group (UASG, established in 2015) that aims to ensure all domain names and email addresses -- including those with non-ASCII characters or new TLDs -- are treated equally by all internet-connected software31. It sounds obvious, but it's far from reality. Try entering an email address like user@example.中国 into most web forms and you'll get a validation error. Many systems reject domain names that are "too long," don't recognize new TLDs, or simply can't handle non-ASCII characters in email addresses.
The progress has been slow. Between 2022 and 2025, the share of domains with UA-ready mail servers went from roughly 20% to about 28%31. Software libraries, programming frameworks, and web applications still routinely fail to handle IDN email addresses correctly. The UASG operated for 10 years and produced guidelines and toolkits, but ICANN wound down its direct funding and staff support for the group in June 2025, transitioning responsibility to a new UA Expert Working Group32.
This is arguably the most frustrating part of the whole IDN story. The standards exist, the encoding works perfectly (Punycode is elegant and battle-tested), the domains are registered and resolving -- but the applications people use every day still treat non-ASCII domain names as second-class citizens. A Russian user with a .рф email address or a Chinese user with a .中国 address is still going to hit walls in countless online forms, login systems, and applications that assume email addresses are pure ASCII.
The technical barrier that existed in 1983 has been solved. The practical barrier, twenty-three years after Punycode was standardized, is mostly about software developers not bothering to implement what's already there.
Citations
-
RFC 882: Domain Names - Concepts and Facilities. P. Mockapetris, November 1983 ↩
-
RFC 952: DoD Internet Host Table Specification. K. Harrenstien, M. Stahl, E. Feinler, October 1985 ↩ ↩2
-
RFC 1123: Requirements for Internet Hosts -- Application and Support. Section 2.1. Retrieved March 1, 2026 ↩ ↩2
-
INFITT: A New Architecture for Multilingual Internet Domains. Retrieved March 1, 2026 ↩
-
Unicode Consortium: History of Unicode. Retrieved March 1, 2026 ↩
-
Joseph D. Becker: Unicode 88. Xerox Corporation, August 29, 1988 ↩
-
Unicode Consortium: About Unicode. Retrieved March 1, 2026 ↩
-
Unicode Consortium: Release and Publication Dates. Retrieved March 1, 2026 ↩
-
RFC 3490: Internationalizing Domain Names in Applications (IDNA). Retrieved March 1, 2026 ↩ ↩2 ↩3 ↩4
-
RFC 3492: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). A. Costello, March 2003 ↩ ↩2
-
RFC 3492: Punycode: A Bootstring encoding of Unicode for IDNA. Sections 3 and 6. Retrieved March 1, 2026 ↩ ↩2 ↩3
-
RFC 1035: Domain Names -- Implementation and Specification. Section 2.3.4. Retrieved March 1, 2026 ↩
-
IANA: Results of IANA Selection of IDNA Prefix (February 14, 2003). Referenced in RFC 3490, Section 5. Retrieved March 1, 2026 ↩ ↩2
-
Unicode Consortium: Unicode 16.0.0. Released September 10, 2024. Retrieved March 1, 2026 ↩ ↩2
-
RFC 5890: Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework. Retrieved March 1, 2026 ↩
-
RFC 5891: Internationalized Domain Names in Applications (IDNA): Protocol. Retrieved March 1, 2026 ↩
-
RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA). Retrieved March 1, 2026 ↩ ↩2 ↩3
-
RFC 3491: Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN). Retrieved March 1, 2026 ↩
-
Unicode Technical Standard #46: Unicode IDNA Compatibility Processing. Retrieved March 1, 2026 ↩
-
Unicode Technical Report #36: Unicode Security Considerations. Retrieved March 1, 2026 ↩ ↩2
-
ICANN SSAC Advisory SAC037: Display and Usage of Internationalized Registration Data. Retrieved March 1, 2026 ↩
-
Evgeniy Gabrilovich and Alex Gontmakher: The Homograph Attack. Communications of the ACM, 45(2):128, February 2002 ↩
-
Unicode Technical Standard #39: Unicode Security Mechanisms. Retrieved March 1, 2026 ↩ ↩2
-
Xudong Zheng: Phishing with Unicode Domains. April 2017. Retrieved March 1, 2026 ↩
-
Chromium: Internationalized Domain Names (IDN) in Google Chrome. Retrieved March 1, 2026 ↩ ↩2
-
Hang Hu, Steve T.K. Jan, Yang Wang, Gang Wang: Assessing Browser-level Defense against IDN-based Phishing. 30th USENIX Security Symposium, 2021 ↩ ↩2 ↩3
-
Mozilla: IDN Display Algorithm. Retrieved March 1, 2026 ↩
-
EURid / UNESCO: IDN World Report 2024. Retrieved March 1, 2026 ↩ ↩2 ↩3
-
ICANN: IDN ccTLD Fast Track Process. Retrieved March 1, 2026 ↩ ↩2
-
ICANN: IDN Annual Report June 2025. Retrieved March 1, 2026 ↩
-
ICANN: Universal Acceptance (UA). Retrieved March 1, 2026 ↩ ↩2
-
ICANN: Universal Acceptance: Aligning Resources and the Path Forward. April 30, 2025. Retrieved March 1, 2026 ↩
Updated: March 1, 2026