Unicode Normalization and Android
I ran into a Unicode rendering issue recently that I wanted to go over. Check out this screenshot from an application we've recently localized into Vietnamese:
Notice how a lot of the text looks cut off (especially those blue labels). What's going on?
It turns out that the TextViews
are rendering diacritics improperly. Sometimes the diacritic ends up on the wrong character. Other times, it ends up on its own space. The layout system seems to figure out the width correctly but the renderer screws it up, causing the text to be cut off when layout_width
is set to wrap_content
. The root of the problem is that Android is messing up combining characters (or combining diacritical marks) in unicode.
The solution that we found was unicode normalization. A character with diacritics can always be represented by a combination of multiple code points, but sometimes there is also a single code point that represents the character. We found that by using unicode's normalization form C (NFC), we could normalize most of the combining diacritical characters out of text (and thus sidestep this problem altogether).
There's two steps we're taking to normalize our text:
-
For all strings that are in our APK, we normalized them inside strings.xml. There are plenty of tools out there (like charlint) that can do the job.
-
We run all strings delivered from servers through
Normalizer
while parsing. It did not seem to degrade performance to do so.
Step 2 may not be necessary if your app is self-contained and never gets any strings from the net. Also, you could just use Normalizer
as your tool for step 1 (it's up to you).
We found this problem to only happen on ICS - it appears to work fine on honeycomb and below. Also, if you want to know more about unicode normalization, I highly recommend reading this article.