Generating test cases for Unicode-enabled software

September 10th, 2008 by Chris Weber

When it comes to Unicode implementations, there’s a rich set of test
cases to perform. Realizing it is the start. Automating it is the next
step.

At a high-level Unicode-related security bugs can be categorized into the following root-causes:

Canonicalization

  • Interpreting non-shortest form (e.g .UTF-8 encoding trickery)
  • Other decoding issues

Absorption (over-consumption)

  • Over-consuming invalid byte sequences or correcting rather than failing
  • When <41 C2 C3 B1 42>  becomes <41 42>

Character deletion and swallowing

  • “deletion of noncharacters” (UTR-36)
  • <scr[U+FEFF]ipt> becomes <script>
  • Use replacement characters instead!

Interpreting Syntax replacements

  • white space and line feeds
  • E.g. when U+180E acts like U+0020

Best-fit mappings

  • When σ becomes s
  • When ′ becomes ‘

Buffer overruns

  • Incorrect assumptions about string sizes (chars vs. bytes)
  • Improper width calculations

Timing issues

  • handling Unicode after security gates
  • Sometimes handling Unicode before a gate can be a problem too! E.g. BOM handling

Tags:



Leave a Comment