Really good article - http://htmlpurifier.org/docs/enduser-utf8.html#findcharset

Collection of challenges and solutions - http://tomi.panula-ont.to/i18n/