A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Friday, April 24, 2009

Ruby 1.9 and String Encoding

Ruby 1.9 implements a load of Internationalization features, which is great, but I just ran into one unfortunate side effect of that.

I work with large text files representing DNA sequences, patents, etc. These are typically plain ASCII text and that is how I treat them. Under Ruby 1.8 everything seemed fine. But running the same code on a 2GB text file I got this error:

$ ./test.rb myfile
./test.rb:9:in `block in <main>': invalid byte sequence in US-ASCII (ArgumentError)
from ./test.rb:3:in `each_line'
from ./test.rb:3:in `<main>'

Here is code that gave rise to that:
#!/usr/bin/env ruby
open(ARGV[0], 'r').each_line do |line|
if line =~ />(\S+)/
puts line
end
end

Somewhere in the middle of the input file is a non-ASCII character and Ruby 1.9 won't take it. It turns out that 1.9 takes a much stricter line on interpreting text. Unless you tell it otherwise, it expects plain ASCII and anything else is an error. 1.8 just took what you gave it.

If you know you will be reading UTF-8 or ISO-8859-1 text then you can explicitly tell your script to handle it. There are several ways to do this but in this simple example you can change the 'r' in the open statement like this:
#!/usr/bin/env ruby
open(ARGV[0], 'r:utf-8').each_line do |line|

That's OK if you know the encoding, but in my work I see occasional non-ASCII characters, such as German umlauts, that have crept into public data files that I work with. I don't know what to expect and I don't want to clutter my code with rescue clauses to handle all possibilities.

The solution for my problem is to treat the text as binary by using the 'rb' modifier in the File.open statement. I can still process text data line by line but Ruby will swallow non-ASCII characters. So this version of the code takes the input data with no problems:
#!/usr/bin/env ruby
open(ARGV[0], 'rb').each_line do |line|
if line =~ />(\S+)/
puts line
end
end

My problem stemmed from two umlaut characters buried deep in the file. To figure out which lines were causing the problem I used this variant of the code to output bad lines.
#!/usr/bin/env ruby
open(ARGV[0], 'r').each_line do |line|
begin
if line =~ />(\S+)/
end
rescue
puts line
end
end

Look up the issue and you'll find plenty of debate on the merits or otherwise of this new feature in 1.9. It took me by surprise.


 

No comments:

Archive of Tips