Skip to content

AD3021: NoUnicodeSymbols

Summary

Property Value
ID AD3021
Name NoUnicodeSymbols
Category Correctness
Severity Warning
Applies to ELF (Linux/Unix)

Description

This rule checks that ELF binaries do not contain symbols with Unicode characters that could be used for "Trojan Source" attacks or other obfuscation techniques. Malicious Unicode characters can make code appear different from what it actually does.

Why This Matters

Unicode-based attacks like "Trojan Source" can make malicious code invisible to human reviewers while remaining functional to compilers. These attacks target the supply chain and code review process, making them particularly dangerous.

The Trojan Source Attack (CVE-2021-42574)

Bidirectional Unicode characters can reorder how code displays:

# This appears safe in some editors:
if access_granted:
     }return evil_action();  # RLO reverses display
    safe_action()

# But actually executes:
if access_granted:
    safe_action()    # This is actually commented out!
return evil_action()  # This actually runs!

Attack Categories

Attack Type Technique Risk
Trojan Source Bidirectional text Hidden malicious code
Homoglyph Look-alike chars Typosquatting, confusion
Zero-width Invisible chars Hide code differences

Bidirectional Control Characters

Character Code Point Effect
RLO U+202E Right-to-Left Override
LRO U+202D Left-to-Right Override
RLE U+202B Right-to-Left Embedding
LRE U+202A Left-to-Right Embedding
PDF U+202C Pop Directional Formatting
RLI/LRI U+2067/2066 Isolate variants

Homoglyph Attack Example

// Legitimate function
void authenticate(char* password) { ... }

// Malicious function with Cyrillic 'а' (U+0430)
void аuthenticate(char* password) {
    log_password_to_attacker(password);
    real_authenticate(password);
}

// Code calling 'authenticate' might call either!

Supply Chain Implications

1. Attacker submits "innocent" pull request
2. Code review: Looks harmless to humans
3. Merge: Malicious code enters codebase
4. Build: Compiler sees real (malicious) code
5. Distribution: Malware in legitimate package

Detection Challenges

Environment Visibility
Most text editors Characters invisible or confusing
GitHub (updated) Now warns on Bidi
Diff tools Often don't show
Command line May or may not show

Defense Layers

Layer Protection
Compiler warnings -Wbidi-chars
Binary analysis This rule
Code review tools Unicode highlighting
CI/CD gates Reject suspicious commits
  • Trojan Source attacks: Bidirectional Unicode characters can reorder code visually
  • Homoglyph attacks: Look-alike characters can disguise malicious functions
  • Code review bypass: Unicode tricks can hide malicious code from reviewers
  • Supply chain security: Detects potentially compromised dependencies

Dangerous Unicode Categories

  1. Bidirectional control characters: RLO, LRO, RLE, LRE, PDF, etc.
  2. Homoglyphs: Characters that look like ASCII but aren't (е vs e, а vs a)
  3. Zero-width characters: ZWSP, ZWNJ, ZWJ
  4. Confusable characters: Mathematical symbols that resemble letters

How to Fix

Use ASCII-only identifiers

// Bad: Contains Cyrillic 'а' (U+0430) instead of ASCII 'a'
void dаnger() { } // Looks like "danger" but isn't

// Good: ASCII only
void danger() { }

Enable compiler warnings

# GCC 12+ / Clang 14+
gcc -Wbidi-chars=any -Werror=bidi-chars myapp.c

Scan for suspicious characters

# Find non-ASCII in source files
grep -rP '[^\x00-\x7F]' src/

Detection Method

aldur scans all symbol names in: - .symtab (symbol table) - .dynsym (dynamic symbols) - .strtab and .dynstr (string tables)

Example

Warning: Binary contains suspicious Unicode symbols

Symbol 'mаin' contains non-ASCII character U+0430 (CYRILLIC SMALL LETTER A)

Pass: All symbols use ASCII characters only

No suspicious Unicode characters in symbol names

References

See Also