A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Tuesday, February 1, 2011

Ruby 1.9 and incompatible character encodings

I run into issues pulling remote text data into a Ruby 1.9 / Rails 3 app, which is using utf-8 encoding by default. The problem apparently comes from non-Ascii characters in binary or so-called ASCII-8BIT encoded text. I don't have a proper way to translate the offending characters as yet but my workaround is to strip them out and/or replace them with an ASCII character.

This regex implements the workaround. Be sure to use the 'n' modifier on the regex. This specifies that the encoding of the text should be ignored and thus multibyte characters are treated as separate bytes.
    str.gsub!(/[^\x00-\x7F]/n,'?')

Far from perfect, but it gets the job for me right now.

 

No comments:

Archive of Tips