AD5016: NoUnicodeSymbolsMachO¶
Summary¶
| Property | Value |
|---|---|
| ID | AD5016 |
| Name | NoUnicodeSymbolsMachO |
| Category | Correctness |
| Severity | Warning |
| Applies to | Mach-O (macOS, iOS) |
Description¶
Symbol names containing non-ASCII characters (Unicode) can be used for malicious purposes:
Homograph Attacks¶
An attacker can define a symbol that appears visually identical to a legitimate symbol but uses different Unicode characters:
- Cyrillic а (U+0430) vs ASCII a (U+0061)
- Greek ο (U+03BF) vs ASCII o (U+006F)
- Many other lookalike characters
This could allow: - Malicious code to masquerade as legitimate library functions - Confusion during security audits and code review - Obfuscation of malicious behavior
Why This Matters¶
The Unicode character set contains over 143,000 characters across 154 scripts, many of which are visually identical or nearly identical to ASCII characters. This creates a significant security risk known as "homograph attacks" or "visual spoofing," where malicious code hides in plain sight by using lookalike characters.
The Trojan Source Threat¶
In 2021, researchers from Cambridge University published "Trojan Source" (CVE-2021-42574), demonstrating how Unicode-based attacks could inject invisible vulnerabilities into source code that passes human review. While this primarily targeted source code, the same principles apply to binary analysis.
Attack Scenarios¶
- Function Impersonation: An attacker creates a function named
mаlloc(using Cyrillic 'а') that looks identical tomallocin most fonts and terminals. The malicious function might: - Log allocation sizes and addresses for later exploitation
- Return attacker-controlled memory regions
-
Introduce subtle memory corruption
-
Supply Chain Attacks: A compromised build system or dependency could introduce symbols with Unicode characters that:
- Hook security-critical functions while evading symbol-based detection
- Create backdoors that audit tools miss because they search for ASCII function names
-
Persist through code reviews that don't render Unicode distinctly
-
Malware Obfuscation: Malware authors use Unicode symbols to:
- Evade signature-based detection that matches ASCII strings
- Confuse reverse engineers examining the binary
- Hide malicious functionality in apparently-normal symbol names
Confusable Characters¶
Some particularly dangerous confusables include:
| ASCII | Lookalike | Script | Unicode |
|---|---|---|---|
| a | а | Cyrillic | U+0430 |
| e | е | Cyrillic | U+0435 |
| o | о | Cyrillic | U+043E |
| p | р | Cyrillic | U+0440 |
| c | с | Cyrillic | U+0441 |
| x | х | Cyrillic | U+0445 |
| O | Ο | Greek | U+039F |
| A | Α | Greek | U+0391 |
These characters are pixel-perfect matches in many fonts, making visual detection nearly impossible.
Why Compiled Binaries Rarely Have Unicode¶
Legitimate C, C++, Objective-C, and Rust code almost never produces Unicode symbols because:
- Standard C/C++ historically only supports ASCII identifiers (C++11 added Unicode support, but it's rarely used)
- Objective-C method names are typically ASCII
- System libraries and frameworks use ASCII exclusively
- Build tools and linkers may not handle Unicode correctly
The presence of non-ASCII symbols in a compiled binary is therefore a strong anomaly indicator that warrants investigation.
Performance Impact¶
This check has no runtime performance impact—it only affects static analysis time, and scanning the symbol table is a fast operation.
Resolution¶
Review Unicode Symbols¶
If this rule triggers, manually review the flagged symbols:
-
Legitimate cases: Some frameworks or libraries may use Unicode for internationalization. Document and track these.
-
Suspicious cases: If Unicode symbols appear to mimic standard library or system functions, investigate further.
-
Malware indicators: Symbols designed to look like legitimate functions (e.g., using Cyrillic to spell "malloc") are strong malware indicators.
For Developers¶
Avoid using non-ASCII characters in symbol names:
// Bad - uses lookalike characters
void mаlloc(size_t size); // Uses Cyrillic 'а'
// Good - ASCII only
void custom_malloc(size_t size);
Build System¶
Consider adding linting to catch non-ASCII identifiers: