IDN homograph attack
An IDN homograph attack is a way of deceiving a user by registering a domain name that looks identical to a legitimate one but is actually composed of different Unicode characters. The attacker swaps one or more characters in a well-known domain -- say, apple.com -- with visually indistinguishable characters from another script, like Cyrillic. The result is a domain that appears genuine in the browser's address bar, passes casual inspection, and can even carry a valid SSL certificate with a padlock icon. The victim has almost no way to tell it apart from the real thing.
The attack is possible because of internationalized domain names (IDN), the system that lets people register domains in non-Latin scripts like Arabic, Chinese, or Cyrillic. Under the hood, these domains are encoded as ASCII via Punycode -- a scheme that translates Unicode into DNS-compatible strings prefixed with xn--. When a browser receives a Punycode domain, it decodes it back to Unicode for display. And that's where the problem begins: Unicode contains over 154,000 characters across dozens of scripts, and many of those characters are visually identical or nearly identical to Latin letters1.
The characters that look alike across scripts are called homoglyphs (or confusables, in Unicode Consortium terminology). The Cyrillic lowercase "a" (U+0430) and the Latin "a" (U+0061) are pixel-for-pixel identical in most fonts. So are Cyrillic "o", "c", "e", "p", "x", and "y" when compared to their Latin counterparts. An attacker who registers a domain using only these Cyrillic lookalikes gets a string that displays as a familiar Latin domain name -- but resolves to a completely different server.
A brief history of the attack
The original paper (2001)
The concept was first formally described in December 2001 by Evgeniy Gabrilovich and Alex Gontmakher, two researchers at the Technion -- Israel Institute of Technology. Their paper, "The Homograph Attack," laid out the problem with precision: internationalized domain names would inevitably be exploited for phishing because different scripts contain characters that look the same2.
To prove it wasn't just theoretical, they registered a variant of microsoft.com using Cyrillic "c" (U+0441) and "o" (U+043E) in place of their Latin equivalents. The domain looked identical in a browser. The paper was published in Communications of the ACM in February 2002, and it reads as remarkably prescient -- everything Gabrilovich and Gontmakher warned about has come to pass.
Shmoocon and the first public demonstrations (2005)
The attack stayed largely academic until February 2005, when Eric Johanson (going by "3ric Johanson") of the Shmoo Group demonstrated a practical IDN homograph exploit at the ShmooCon security conference3. Johanson showed that browsers supporting IDNA -- including Firefox 1.0, Safari 1.2.5, and Opera 7.54 -- would render a spoofed PayPal domain (with Cyrillic "a" replacing the Latin one) as the genuine paypal.com. Microsoft's Internet Explorer was not vulnerable, but only because it didn't support IDN display at all at the time.
The demonstration made waves. Within days, ICANN issued a public statement on February 23, 2005, acknowledging the vulnerability and calling for community input on countermeasures4. Mozilla quickly pushed out an update to Firefox that added a TLD whitelist -- only domains under TLDs whose registries had anti-homograph policies would be displayed in Unicode; everything else fell back to Punycode5. Opera and other browsers followed with their own mitigations.
For a few years, the problem seemed contained. Browser vendors had their whitelists and script-mixing rules. Registries tightened their policies. The attack faded from headlines.
The 2017 apple.com proof-of-concept
Then, in April 2017, security researcher Xudong Zheng blew the doors open again.
Zheng registered the domain xn--80ak6aa92e.com, which -- when decoded from Punycode to Unicode -- rendered as what appeared to be apple.com in the address bar6. Every character was a Cyrillic lookalike. The domain was entirely single-script (all Cyrillic), which meant it sailed past the mixed-script detection rules that browsers had been relying on since 2005. Chrome, Firefox, and Opera all displayed it as apple.com, complete with the Unicode rendering. A valid SSL certificate could be obtained for it (Zheng demonstrated with a Let's Encrypt DV cert), showing a reassuring padlock.
Zheng had reported the issue to Chrome's security team on January 20, 2017 and received a $2,000 bounty. Chrome shipped a fix in version 58 (late April 2017) that added whole-script confusable detection -- a check specifically designed to catch domains where all characters come from a single non-Latin script that happens to look like Latin text7. Firefox added similar protections. Safari and Edge were not vulnerable; they had already been blocking all-Cyrillic domains on non-Cyrillic TLDs.
This single attack is probably the reason most developers have even heard of Punycode. It was covered by every major tech publication and fundamentally changed how browsers approach IDN security.
How the attack works, step by step
The mechanics of an IDN homograph attack are straightforward. There's no exploit code, no buffer overflow, no zero-day. It's pure visual deception.
Finding confusable characters. The attacker identifies characters from non-Latin scripts that are visually identical (or near-identical) to the letters in the target domain. Cyrillic is the most commonly used source because it has the highest overlap with Latin lowercase letters, but Greek, Armenian, Cherokee, and various other scripts also contain usable homoglyphs.
Registering the domain. The attacker registers the spoofed domain through any registrar that supports IDN. Internally, the registrar converts the Unicode domain to its Punycode (ACE) representation -- something like xn--80ak6aa92e.com -- and submits it to the registry. The DNS system never sees Unicode; it only stores and resolves ASCII.
Obtaining an SSL certificate. The attacker requests a Domain Validation (DV) certificate from any certificate authority. DV certificates only verify that the applicant controls the domain -- not that the domain is legitimate or non-deceptive8. The CA issues the certificate for the Punycode domain. The browser will display a padlock icon.
Deploying the phishing site. The attacker sets up a convincing replica of the target website on their server. The victim, clicking a link in a phishing email or malicious advertisement, sees a URL that looks exactly like the legitimate site, with HTTPS and a padlock. Unless the victim manually inspects the SSL certificate details or copies the URL and examines it in a text editor, there's no visible indication of fraud.
A worked example
Suppose an attacker wants to impersonate epic.com. They could construct the domain using:
| Position | Displayed character | Actual character | Script | Code point |
|---|---|---|---|---|
| e | e | е | Cyrillic | U+0435 |
| p | p | р | Cyrillic | U+0440 |
| i | i | і | Cyrillic | U+0456 |
| c | c | с | Cyrillic | U+0441 |
The resulting domain еріс.com (all Cyrillic) would encode to xn--e1afmkfd.com in Punycode. In a browser that doesn't flag it, it would display as epic.com.
The confusables problem
The Unicode Consortium is well aware of the visual ambiguity issue. Unicode Technical Standard #39 (UTS #39, "Unicode Security Mechanisms") defines the formal framework for identifying and handling confusable characters9. The companion data file, confusables.txt, maps approximately 6,565 characters to their visual equivalents -- it's essentially a giant lookup table of "these characters look like those characters"10.
How confusability is classified
UTS #39 distinguishes between several types of confusability:
Single-script confusables are characters within the same script that look alike. The Latin letter "l" (lowercase L) and the digit "1" are a classic example. These exist even without internationalization and have been a nuisance since the early days of computing.
Mixed-script confusables occur when characters from different scripts in the same label look like characters from another script. A domain mixing Latin and Cyrillic characters might use Cyrillic "a" in an otherwise Latin string. This is the easiest class to detect -- just check whether a label mixes scripts.
Whole-script confusables are the hardest to catch. The entire label uses characters from a single non-Latin script, but the result looks like a Latin string. The 2017 apple.com attack was a whole-script confusable -- every character was Cyrillic, so no script mixing was detected. The browser has to recognize that the entire Cyrillic string happens to look like a Latin word.
The skeleton algorithm
UTS #39 defines a skeleton function that maps any string to a canonical form by replacing each character with a representative from its confusable set. Two strings that produce the same skeleton are considered confusable. Chrome uses this algorithm to compare domain labels against a list of top domains -- if the skeleton of a Punycode-decoded label matches the skeleton of google or paypal, the browser displays Punycode instead7.
The skeleton algorithm also strips diacritics (so göögle would skeleton to the same form as google) and applies a few extra mappings on top of the Unicode confusables list: ł maps to l, ø maps to o, đ maps to d, and Cyrillic к/ĸ/Greek κ all map to k7.
Which script pairs are most dangerous?
Not all scripts pose equal risk. The threat level correlates with how many visually identical lowercase letter pairs exist between a script and Latin.
| Script pair | Confusable lowercase letters | Risk level |
|---|---|---|
| Latin-Cyrillic | a, c, e, o, p, x, y, s (as ѕ), i (as і), d (as ԁ), h (as һ) | Very high |
| Latin-Greek | o, v, and several uppercase letters (A, B, E, H, I, K, M, N, O, P, T, X, Y, Z) | Medium |
| Latin-Armenian | o (as օ), n (as ո), u (as ս), g (as ɡ) | Medium |
| Latin-Cherokee | Many uppercase letter pairs | Lower (Cherokee uppercase only) |
Cyrillic is by far the biggest threat. An attacker can spell entire common English words using nothing but Cyrillic characters -- ace, cope, peppy, space, and dozens more -- because enough Cyrillic lowercase letters have Latin doppelgangers.
Beyond these script pairs, there's a long tail of individual confusable characters from scripts like Georgian (ა looks like a), Myanmar, Lao, and others. The Unicode confusables.txt file is the authoritative source for the full set.
Browser defenses, in detail
Chrome's multi-layered algorithm
Chrome has arguably the most elaborate IDN display policy of any browser. When Chrome needs to display a domain, it runs a 13-step gauntlet of checks on each label. If any check fails, the user sees raw Punycode (xn--...) instead of Unicode7.
The checks, roughly in order:
- UTS #46 conversion -- attempt to decode the Punycode to Unicode; if conversion fails (invalid characters, BiDi violations, leading combining marks), show Punycode.
- Identifier status -- every character must be allowed under UTS #39's identifier profile.
- Disallowed character list -- Chrome maintains its own blocklist of specific code points.
- Script mixing validation -- applies the "Highly Restrictive" profile from UTS #39. Latin mixed with Cyrillic or Greek in the same label is rejected. Latin + CJK scripts and Han + Kana + Hangul + Bopomofo are allowed.
- Numbering system check -- a label mixing digits from different numbering systems (e.g., Arabic-Indic and Western digits) is flagged.
- Invisible character detection -- repeated combining marks or invisible formatting characters trigger Punycode display.
- Contextual character checks -- characters like the Latin middle dot (U+00B7) are only allowed in specific linguistic contexts (Catalan ela geminada).
- Mixed-script confusable test -- per UTS #39.
- Whole-script confusable detection -- if all letters belong to a known confusable script (Cyrillic, Greek, etc.) and the TLD isn't associated with that script (e.g.,
.ru,.ua,.bgfor Cyrillic), show Punycode. This is the check added after the 2017 apple.com attack. - Digit spoof detection -- labels containing only digits and digit look-alikes are flagged.
- Dangerous pattern matching -- specific known dangerous patterns are caught.
- Skeleton matching against top domains -- the skeleton of the domain (after diacritic removal) is compared against a pre-computed list of top domains' skeletons.
- Default: show Unicode -- if nothing tripped, display the friendly Unicode form.
This system has been in place since Chrome 51 and has been incrementally refined. But it's not airtight. The 2021 Hu et al. study found that Chrome still missed a substantial fraction of crafted homograph domains11.
Firefox's approach
Firefox takes a different path, built around UTS #39 restriction profiles12.
By default, Firefox uses the Highly Restrictive profile (configurable to Moderately Restrictive via network.IDN.highly_restrictive_profile in about:config). Under this profile, all characters in a label must come from Common scripts + Inherited scripts + a single script, or from a small set of approved multi-script combinations:
- Latin + Han + Hiragana + Katakana (Japanese)
- Latin + Han + Bopomofo (Chinese Bopomofo input)
- Latin + Han + Hangul (Korean)
Crucially, Latin mixed with Cyrillic or Greek is never allowed in the same label, regardless of TLD.
Firefox also maintains a TLD whitelist -- TLDs whose registries have their own anti-homograph policies (like .рф, .jp, .de) get more permissive display rules. For whitelisted TLDs, Firefox shows Unicode unconditionally. For everything else, the algorithm applies.
One notable weakness: Firefox has historically not implemented whole-script confusable detection as aggressively as Chrome. The Mozilla team has acknowledged this gap and has placed some of the responsibility on registries to prevent confusable registrations at the source13.
Users can bypass all of this by setting network.IDN_show_punycode to true in about:config, which forces Punycode display for all IDN domains globally -- the nuclear option.
Safari
Safari's approach is the most conservative. Apple maintains a whitelist of scripts that don't contain Latin-confusable characters. Scripts not on that whitelist -- including Cyrillic, Greek, and Cherokee -- are always displayed as Punycode on non-matching TLDs11.
This is why Safari was never vulnerable to the 2017 apple.com attack. Apple's reasoning is simple: better to occasionally show Punycode for legitimate Cyrillic domains on .com than to let a convincing phishing domain through. The trade-off is that Russian or Greek users on Safari sometimes see ugly xn-- strings for perfectly legitimate domains under generic TLDs.
Edge
Since Edge switched to the Chromium engine in January 2020, it inherits Chrome's IDN display policy essentially unchanged. The pre-Chromium EdgeHTML-based Edge had its own separate policy that was broadly similar to Safari's conservative approach.
Registry-level defenses
Browser-side detection is one layer. Preventing the confusable domain from being registered in the first place is another.
ICANN's IDN Implementation Guidelines
ICANN publishes the IDN Implementation Guidelines, most recently updated to version 4.1 in April 202514. These guidelines require registries to:
- Publish an explicit list of Unicode code points permitted for registration under each TLD.
- Restrict each domain label to characters from a single script (no mixed-script labels).
- Implement variant management -- if two or more characters are considered variants of each other (common in Chinese, Japanese, and Korean), only one variant label can be registered, and the others are either blocked or allocated to the same registrant.
- Apply the Unicode Consortium's recommendations from UTS #39 and UTS #46.
How specific registries handle it
Different registries take different approaches.
.рф (Russia) only accepts Cyrillic characters. Since you can't mix Latin into a .рф domain, homograph attacks targeting Latin-script brands on this TLD are structurally impossible -- you can't register apple.рф with Latin characters because the registry won't accept them15.
.de (Germany) allows a limited set of Latin characters with diacritics (umlauts, accents used in German and several neighboring languages). The allowed character table is tightly curated; Cyrillic is not on it.
Chinese TLDs (.中国, .cn) implement variant management for simplified and traditional Chinese character forms. If you register a domain with a simplified character, the traditional variant is automatically reserved (or vice versa), preventing squatters from grabbing the alternate form14.
The gTLD registries (.com, .net, .org) have more heterogeneous policies. Verisign (operator of .com and .net) restricts IDN registrations to predefined language tables and requires single-script labels, but the sheer size of the .com namespace means enforcement can never be as tight as a small ccTLD.
ICANN also requires that no IDN TLD itself can be visually confusable with an existing TLD -- a rule applied during the new gTLD program to prevent, say, a Cyrillic .com lookalike from being delegated.
The certificate authority angle
A domain that passes registration is just a name. To be truly convincing as a phishing site, it needs HTTPS with a trusted certificate.
Domain Validation (DV) certificates are the weak link. A DV cert only proves that the applicant controls the domain -- nothing about whether the domain is legitimate, non-deceptive, or run by a trustworthy entity. Let's Encrypt, the largest DV issuer, has issued certificates for domains containing "paypal" over 14,000 times (most of them phishing sites) according to one widely cited count16. Let's Encrypt's official position is that CAs are "not well positioned to operate anti-phishing and anti-malware operations" and that policing domain names is the responsibility of registrars and browsers8.
They have a point. With thousands of certificate authorities trusted by browsers, an attacker rejected by one CA can simply go to another. And DV validation is automated -- there's no human reviewer looking at the domain name and thinking "wait, that looks like it's trying to impersonate Apple."
Organization Validation (OV) and Extended Validation (EV) certificates require manual identity verification and would catch an attacker trying to impersonate a well-known company. But browsers removed the visual distinction for EV certificates in 2019 (Chrome 77, Firefox 70), so even if the real site has an EV cert, the user won't see any difference between the green padlock on the real site and the regular padlock on the phishing site7.
Certificate Transparency (CT) logs offer a defense-in-depth angle. Every publicly trusted certificate is logged in append-only CT logs, and tools like CertStream provide real-time feeds of newly issued certificates. Security teams can monitor these feeds for certificates issued to domains that resemble their brands -- including homograph variants. Facebook, for instance, monitors CT logs and sends alerts when certificates appear for domains that look like facebook.com17. This isn't prevention, but it enables rapid detection and takedown.
Real-world attacks and incidents
Homograph attacks aren't just a theoretical concern demonstrated by researchers. They've been used in actual phishing campaigns.
Financial services targeting
Bitdefender's 2022 threat analysis documented a pattern of homograph attacks targeting financial institutions and cryptocurrency platforms. Martin Zugec, Bitdefender's technical solutions director, noted that cryptocurrency sites are "a perfect target" because transactions are irreversible and the attack surface (wallet addresses, exchange logins) is high-value18. Blockchain.com was targeted by a $27 million spoofing attack in 2019 that used domain impersonation techniques.
The Microsoft Office blind spot
In 2022, Bitdefender also discovered that all Microsoft Office applications -- Outlook, Word, Excel, OneNote, PowerPoint -- were vulnerable to IDN homograph attacks in a different way than browsers18. When a user hovers over or clicks a link in an Office document, the application previews the link using the Unicode-decoded form, not the Punycode. So a link to xn--80ak6aa92e.com would display as apple.com in the tooltip. Unlike browsers, Office had no confusable detection at all.
Akamai's DNS traffic analysis
Akamai published a study in November 2022 that examined real DNS query traffic over a 32-day window and found 6,670 homograph IDN domains that were actively being queried19. Every single one was accessed by at least one device. On average, 67 new homograph domains appeared daily -- domains that had never been seen before in DNS traffic. A total of 29,071 devices accessed at least one homograph domain during the observation period.
The access patterns were telling: most devices made only 2-5 queries to homograph domains total, suggesting the visits were unintentional (likely victims clicking phishing links) rather than repeated (which would suggest legitimate use).
Detection and prevention
For individual users
The simplest protection is awareness -- if you're logging into a sensitive site, type the URL manually rather than clicking links. But that's asking a lot, and it doesn't help when the link appears in a context that discourages scrutiny (an urgent email from "your bank," for instance).
Firefox users can set network.IDN_show_punycode to true in about:config to force all IDN domains to display as Punycode. It's ugly but effective. Chrome doesn't offer an equivalent user-facing toggle, though extensions exist that highlight or block IDN domains.
For organizations
DNS-level filtering can block known homograph domains before they reach users. Services like Cisco Umbrella, Cloudflare Gateway, and Akamai's Enterprise Threat Protector maintain databases of suspicious domains that include homograph variants.
Email gateway detection is another layer. Many email security products scan URLs in messages for homograph patterns -- checking whether the domain's Punycode decodes to something that looks like a known brand.
Certificate Transparency monitoring, as mentioned above, enables proactive detection. Tools like CertStream, SSLMate's Cert Spotter, and Facebook's CT monitoring infrastructure provide real-time alerts on newly issued certificates that match suspicious patterns.
For security researchers
Several academic tools exist for systematically detecting homograph domains in the wild. ShamFinder (2019) uses automated fuzzing to generate potential homograph domains and check whether they're registered20. PhishHunter (2023) applies Siamese neural networks to visually compare domain renderings, achieving 99.3% detection accuracy on IDN-based phishing domains21. Akamai's approach monitors DNS query patterns for newly seen domains with confusable characteristics.
The Unicode Consortium's own confusables.txt data file can be used directly -- many security tools build their detection logic by iterating through possible character substitutions from this file and checking domain registration databases.
The arms race that can't be won
The 2021 USENIX Security paper by Hang Hu, Steve T.K. Jan, Yang Wang, and Gang Wang from the University of Illinois at Urbana-Champaign is probably the most thorough assessment of the state of play11. They systematically tested Chrome, Firefox, Safari, Edge, and IE (both desktop and mobile) against a battery of crafted homograph IDNs and found that every browser could be bypassed.
Safari caught 90.3% of homograph domains. Firefox caught only 6.1%. Chrome fell somewhere in between. The researchers showed that browser defenses are not monotonically improving -- Chrome had actually reversed some rules over time to re-allow certain IDN patterns that it had previously blocked, presumably to avoid breaking legitimate domains.
The user study component was equally sobering. Even when homograph IDNs bypassed browser defenses and were displayed in Unicode, users who were warned to look for suspicious URLs performed only marginally better at identifying fakes than those who weren't warned. The odds of correct identification improved by a factor of 1.59 for users with more than 3 years of web browsing experience -- but that still left a large majority fooled.
The fundamental problem is that visual similarity is inherently subjective and context-dependent. Whether two characters "look the same" depends on the font, the rendering engine, the screen resolution, and the user's familiarity with the scripts involved. Unicode has over 154,000 characters and that number grows with each version. You can't enumerate every possible confusable pair, because the visual similarity is a property of the rendering, not the code points themselves. A new font could create new confusable pairs that no one anticipated.
This is, in a real sense, an unsolvable problem. You can make it harder for attackers -- browser heuristics, registry restrictions, CT monitoring -- but you can't eliminate the fundamental tension between wanting to display domain names in people's native scripts and wanting to prevent visual deception. The 2005 fix didn't hold. The 2017 fix was an improvement but still has gaps. And every year, Unicode adds more characters from more scripts, expanding the attack surface.
The best defense remains layered: registries that restrict confusable registrations, browsers that detect and flag them, CAs that log everything transparently, security products that monitor DNS and CT logs, and -- ultimately -- users who know that a padlock and a familiar-looking URL are not guarantees of authenticity.
Citations
-
Unicode Consortium: Unicode 16.0.0. Released September 10, 2024. Retrieved March 1, 2026 ↩
-
Evgeniy Gabrilovich and Alex Gontmakher: The Homograph Attack. Communications of the ACM, 45(2):128, February 2002 ↩
-
Computerworld: Experts: International domain names may pose threat. February 2005. Retrieved March 1, 2026 ↩
-
ICANN: ICANN Statement on IDN Homograph Attacks and Request for Public Comment. February 23, 2005. Retrieved March 1, 2026 ↩
-
Mozilla Bugzilla: Bug 279099 -- Protect against homograph attacks (spoofing using punycode IDNs). Retrieved March 1, 2026 ↩
-
Xudong Zheng: Phishing with Unicode Domains. April 2017. Retrieved March 1, 2026 ↩
-
Chromium: Internationalized Domain Names (IDN) in Google Chrome. Retrieved March 1, 2026 ↩ ↩2 ↩3 ↩4 ↩5
-
Let's Encrypt: The CA's Role in Fighting Phishing and Malware. October 29, 2015. Retrieved March 1, 2026 ↩ ↩2
-
Unicode Technical Standard #39: Unicode Security Mechanisms. Retrieved March 1, 2026 ↩
-
Unicode Consortium: confusables.txt. Retrieved March 1, 2026 ↩
-
Hang Hu, Steve T.K. Jan, Yang Wang, Gang Wang: Assessing Browser-level Defense against IDN-based Phishing. 30th USENIX Security Symposium, 2021 ↩ ↩2 ↩3
-
Mozilla Wiki: IDN Display Algorithm. Retrieved March 1, 2026 ↩
-
Mozilla Bugzilla: Bug 1332714 -- IDN Phishing using whole-script confusables on Windows and Linux. Retrieved March 1, 2026 ↩
-
ICANN: IDN Implementation Guidelines. Version 4.1, April 2025. Retrieved March 1, 2026 ↩ ↩2
-
ICANN: IDN ccTLD Fast Track Process. Retrieved March 1, 2026 ↩
-
Bleeping Computer: 14,766 Let's Encrypt SSL Certificates Issued to PayPal Phishing Sites. Retrieved March 1, 2026 ↩
-
Hardenize: Certificate Transparency Monitoring for Phishing Detection. Retrieved March 1, 2026 ↩
-
Bitdefender: Homograph Phishing Attacks -- When User Awareness Is Not Enough. 2022. Retrieved March 1, 2026 ↩ ↩2
-
Akamai: Watch Your Step: The Prevalence of IDN Homograph Attacks. November 2022. Retrieved March 1, 2026 ↩
-
Hiroaki Suzuki et al.: ShamFinder: An Automated Framework for Detecting IDN Homographs. 2019 ↩
-
PhishHunter: Detecting camouflaged IDN-based phishing attacks via Siamese neural network. Computers & Security, December 2023 ↩
Updated: March 1, 2026