2011-11-17

Removing diacritics from a string

Recently I had to write a function that removes all diacritics from a string (e.g.: turning José into Jose). Searching the web, I quickly found the blog post “Stripping is an interesting job” by Michael Kaplan. His code is simple and good, but I saw some opportunities for optimizations (obvious stuff): because we know the approximate length (actually the maximum length) of the resulting string, we could give the StringBuilder-instance an initial capacity equal to the length of the original string. In some simple cases I actually like using a char-array instead of a StringBuilder, because it has even less overhead. Another obvious optimization is to check whether the original string is not empty. Here's my optimized version:
public static string RemoveDiacritics(this string value)
{
  if (value == null) throw new ArgumentNullException("value");

  if (value.Length > 0)
  {
    char[] chars = new char[value.Length];
    int charIndex = 0;

    value = value.Normalize(NormalizationForm.FormD);
    for (int i = 0; i < value.Length; i++)
    {
      char c = value[i];
      if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        chars[charIndex++] = c;
    }

    return new string(chars, 0, charIndex).Normalize(NormalizationForm.FormC);
  }

  return value;
}

0 comments: