Using NArray to crunch the Instagram Engineering Challenge

Recently Instagram put forward a challenge: Instagram Engineering Challenge: The Unshredder.

There are some solutions out there already:

That's just in the first page of Google hits.

Here at Centro, the team all challenged each other and shared our solutions. Here's mine: https://github.com/timgaleckas/instagram_challenge.

When I solved the problem, I was pretty proud of myself until one of the wiseguys here threw this image at my code: Shredded Mona Lisa.

Of course, my original code fell over. It took 21 seconds to to un-shred an image that large. It became a point of pride to see how fast I could get it running.

I tried:

  • Caching
    • This cut off 3 seconds for a speed up to 18 seconds.
  • Unrolling loops and skipping pixels
    • This cut the run time all the way down to 3 seconds.
    • It, however, cost me some resolution.
    • It failed if I started at the second pixel instead of the first.
  • Switching to NArray
    • This ran with comparable speed (5 seconds.)
    • This does not sacrifice any of the resolution.
  • Reducing the scope of the check
    • This reduced the time back to the blazing fast sub 3 second time.
    • This does not sacrifice any vertical resolution.

If you take a careful look at the code and do the math, you'll see that switching to NArray allowed me to go from only checking 100 pixels for each column to checking the full 1549 pixels while staying within one second of the time for the method that picks and chooses pixels to check.

Seems pretty powerful. That's why I'd like to give you a little explanation of how we're using NArray in this case and hopefully give you an incredibly powerful tool in your ruby tool box.

The loop we're trying to optimize is:

def percentage_pixels_that_are_edgelike(other_column)
  @percentage_by_other_column ||= {}
  return @percentage_by_other_column[other_column] if @percentage_by_other_column[other_column]
  # this line is doing all the work
  @percentage_by_other_column[other_column] = (0...pixels.size).select{|index|pixel_is_edgelike?(index, other_column)}.size.to_f / pixels.size
end

def pixel_is_edgelike?(index, other_column)
  (self[index+1] && (self[index]-other_column[index]).intensity > (self[index]-self[index+1]).intensity) ||
  (self[index-1] && (self[index]-other_column[index]).intensity > (self[index]-self[index-1]).intensity)
end

We know that creating that inner block that many times is inefficient and we know that using select{}.size is inefficient. That's why my first attempt at optimization was just to unroll the iteration into a simple loop.

What NArray allows us to do, on the other hand, is remove both iteration and math over arrays of data from ruby and push it down to a native extension.

So what used to be:

def add_arrays( array_1, array_2 )
  new_array = []
  array_1.each_index{ |index| new_array[index] = array_1[index] - array_2[index] }
  new_array
end

Turns into:

#we don't even need a method for this but,
def add_narrays( narray_1, narray_2 )
  narray_1 - narray_2
end

And the fact that NArray handles math between multidimensional arrays just as easily allows us to do stuff like this:

[
  [ 1, 3, 3 ],
  [ 2, 1, 3 ],
  [ 2, 2, 1 ]
] +
[
  [ 0, 1, 2 ],
  [ 3, 4, 5 ],
  [ 6, 7, 8 ]
]
->
[
  [ 1, 4, 5 ],
  [ 3, 5, 8 ],
  [ 8, 9, 4 ]
]

and even cooler:

[
  [ 1, 3, 3 ],
  [ 2, 1, 3 ],
  [ 2, 2, 1 ]
] * 2
->
[
  [ 2, 6, 6 ],
  [ 4, 2, 6 ],
  [ 4, 4, 2 ]
]

So instead of using RMagick's Pixel class and defining a custom :+ operator, we convert the array of pixels to a multidimensional array of color values:

NArray.to_na(pixels.map{|p|[p.red.to_f,p.green.to_f,p.blue.to_f]})

Which allows us to find the absolute difference between pixel columns by:

(pixel_array1 - pixel_array2).abs

And we multiply these pixel arrays by the function to convert to intensity:

def intensity_transform(narray)
  #http://www.imagemagick.org/RMagick/doc/struct.html#Pixel
  #The intensity is computed as 0.299*R+0.587*G+0.114*B.
  a = narray * [0.299,0.587,0.114]
  a[0,true] + a[1,true] + a[2,true]
end

This method multiplies each red value by 0.299, each green value by 0.587, each blue value by 0.114, and then adds the resultant arrays together to find pixel intensity.

Adding parallel arrays works naively for values that run horizontally across the arrays, but our formula needs to compare pixels to the top and the bottom of the current pixel.

Given the pixel map:

C1 C2
a0 b0
a1 b1
a2 b2
a3 b3
a4 b4

The formula to check that a2 is closer to the pixel above or below it than to the pixel next to it is: ( (a2 - b2) > (a2 - a1) ) || ( (a2 - b2) > (a2 - a3) )

To set up the data to allow us to use only operations across parallel arrays, we add shifted copies of the first array to our dataset like so:

C1 C2 C1[U] C1[D]
a1 b1 a2 a0
a2 b2 a3 a1
a3 b3 a4 a2

( Notice the removal of the top and bottom row from c1 )

So the formula for any given index i, becomes: ( ( C1[i] - C2[i] ) > ( C1[i] - C1_U[i] ) ) || ( ( C1[i] - C2[i] ) > ( C1[i] - C1_D[i] ) )

The actual NArray code looks like this:

def compute_percentage_pixels_that_are_edgelike( other_column )
  s = @pixels
  o = other_column.pixels
  # NArray[ range_for_dimensions, range_for_rows ]
  s1 = s[0..2,1..-2]
  column_difference = intensity_transform((s1 - o[0..2,1..-2]).abs)
  difference_up     = intensity_transform((s1 - s[0..2,0..-3]).abs)
  difference_down   = intensity_transform((s1 - s[0..2,2..-1]).abs)
  ((column_difference > difference_down).or( column_difference > difference_up )).where.total.to_f / s[0,true].total
end

Notice the 'where' method and the 'total' method.

'where' returns an array of the indices for which the predicate is true and 'total' is the same as the ruby array's size method ( but faster )

Hopefully this has been an enlightening introduction to NArray for some of you. It is a pretty cool library that can challenge the way you think about a problem and throw some pretty crazy speedup into the mix.

I'm not saying that NArray is perfect. Far from it. It has bugs (I ran into a couple just in this short exercise,) is difficult to use, lacking in documentation, and has an api that is far from desired. But what it lacks in frills and polish, it far makes up for in pure speed when it comes to crunching arrays and matrixes of numbers.

I hope it has made it into your ruby toolbox.

Taking your test suite plaid

Hi, my name is Tim, and I too have a bloated and slow test suite. That’s the first step isn’t it? Realizing you have a problem.

Well, we have a problem. So we’ve decided to make it a big priority to get our test suite to run in a reasonable time frame. I’ve heard of stories, that may be only told to scare the new guys, of the full test suite taking over 35 minutes to run before I joined the team.

A key fact we just can’t get around is that we have a lot of code, both in our tests and our application. Here are the stats:

+----------------------+-------+-------+---------+---------+-----+-------+  
| Name                 | Lines |   LOC | Classes | Methods | M/C | LOC/M |  
+----------------------+-------+-------+---------+---------+-----+-------+  
| Controllers          |  4105 |  3333 |      59 |     423 |   7 |     5 |  
| Helpers              |  2073 |  1699 |       1 |     266 | 266 |     4 |  
| Models               | 15093 | 11668 |     218 |    1683 |   7 |     4 |  
| Libraries            |  4842 |  3723 |      64 |     477 |   7 |     5 |  
| Integration tests    |   676 |   506 |      17 |      29 |   1 |    15 |  
| Functional tests     | 14059 | 11496 |      92 |     547 |   5 |    19 |  
| Unit tests           | 31944 | 26450 |     195 |    1194 |   6 |    20 |  
| GoldStandard Tests   |   149 |   114 |       6 |      26 |   4 |     2 |  
+----------------------+-------+-------+---------+---------+-----+-------+  
| Total                | 72941 | 58989 |     652 |    4645 |   7 |    10 |  
+----------------------+-------+-------+---------+---------+-----+-------+  
  Code LOC: 20423     Test LOC: 38566     Code to Test Ratio: 1:1.9

There are obviously a lot of possible places to start. We didn’t want to get too far deep into a rabbit-hole, so we decided to have the initial work be a three-day time-boxed effort inside our sprint.

I dusted off my copy of ‘Grease Your Suite’ (http://grease-your-suite.heroku.com) and started to work on it. If you haven’t read ‘Grease Your Suite’, I highly recommend reading at least the fast_context section before you continue with this post.

Quezinart Speed

I decided to start with the unit test suite. We already use rvm, ree, and the gc environment variables to try to get us some more speed out of our suite and we still come in at just over 10 minutes (702 seconds) for just the unit tests.

First things first, I wanted to see which unit tests are hurting the most. I pulled down the rbtrace (https://github.com/tmm1/rbtrace) gem, ran our unit tests, and invoked rbtrace like so:

rbtrace --pid 15024 -c testunit | sort -n -k 2 -t'<' | tail -10

rbtrace comes with a predefined tracer to measure the run time of each of your test suites. I sort that list and chop off the worst 10 to tackle first:

Test::Unit::TestSuite#run(name="ManifestSearchTest", size=18) <18.094087>  
Test::Unit::TestSuite#run(name="SiteMarketTest", size=40) <19.251209>  
Test::Unit::TestSuite#run(name="ManifestTest", size=154) <21.322834>  
Test::Unit::TestSuite#run(name="CollaborationTest", size=144) <22.447253>  
Test::Unit::TestSuite#run(name="ProposalGeneratorTest", size=35) <24.215540>  
Test::Unit::TestSuite#run(name="PublisherOrderMailerTest", size=29) <25.795780>  
Test::Unit::TestSuite#run(name="PlanTest", size=165) <32.161778>  
Test::Unit::TestSuite#run(name="PublisherOrderTest", size=69) <32.693467>  
Test::Unit::TestSuite#run(name="PlanRowTest", size=155) <33.549538>  
Test::Unit::TestSuite#run(name="PlacementTest", size=144) <120.024747>

Holy slow code Batman! We have a specific unit test suite that takes 120 seconds. Two full minutes of our unit tests are in one test suite. Wow.

With a little bit more instrumentation and liberal use of Benchmark.realtime, I narrowed down a big part of our slowdown to the setup in a couple of our parent contexts. These were being called for just about every test down the line.

context "a placement" do  
  setup do  
    #this benchmark comes in at somewhere around 1 sec just to create this factory  
    #Benchmark.realtime do  
    @placement = Factory(:placement)  
    #end  
    @placement.stubs(:ordered_placement).returns(nil)  
  end  

  should "do the right thing" do  
    assert some_code  
  end  

  context "when it has some property" do  
    setup do  
      @placement.stubs(:some_property)  
    end  

    should "test further use cases with this property" do  
      assert @placement.some_property.something  
    end  
  end  

  ...

There are roughly 170 tests in this suite. If one of top contexts has a setup that takes about one second, and that setup gets repeated for everything underneath it, you can see how that time adds up. The reason it takes so long to create that factory is because it has a pretty deep dependency graph to get a usable object. Every test only tests a small slice of that object, but we still need the full, valid object around.

Light Speed

It would be nice if we could just set it up once and just access it in each test. This is where fast_context (https://github.com/lifo/fast_context) comes in really handy because that is exactly what it’s supposed to do. Unfortunately, when I implemented fast_context it required a LOT of code changes. For every change we made to the placement, we had to introduce a reverse change at the end of the should (teardown doesn’t work too well in fast_context land) to put the factory-built object back into its initial state. With all those code changes, this only brought the test suite down under 50 seconds and lets be honest, a 50 second test just isn’t going to cut it either. So I did some more instrumentation to find out where the rest of our time was going. It turns out fast_context only works within tests for a single context, e.g. the following code:

fast_context 'a placement' do  
  setup { @placement = Factory(:placement) }  
  should 'be placed?' do  
    assert_equal @placement.placed?  
  end  
  should "return its plan's name" do  
    assert_equal 'Plan One', @placement.plan_name  
    assert_equal 'Plan One', @placement.plan.name  
  end  
end

works fine and only calls the Factory(:placement) once. However, the following code:

fast_context 'a placement' do  
  setup { @placement = Factory(:placement) }  
  should 'be placed?' do  
    assert @placement.placed?  
  end  
  fast_context "plan delegation" do  
    should "return its plan's name" do  
      @placement.plan.expects(:name).and_returns('Plan One')  
      assert_equal 'Plan One', @placement.plan_name  
      #try to put it back to a pristine state for the next run  
      @placement.plan.unstub(:name)  
    end  
  end  
end

will run the Factory(:placement) twice. Once for ‘a placement’ and once for ‘plan delegation.’

Ridiculous Speed

The only way that I could think of to fix that with fast_context is to flatten out the contexts as much as possible and try to have a single level. When I did that, the test did get down to 24 seconds a run. The trade-off, however, was a pretty high price in less readability and more difficult maintainability. The constant work to set the factory-built object back to its initial state made for extremely verbose and error prone code. The flattening out cost us all of our contexts that we used for grouping and labeling.

Ludicrous Speed

I started thinking about different ways to cache the placement after it was first made and bring out copies of it for each test. I came up with PureFunction, named after http://en.wikipedia.org/wiki/Pure_function. If you look at the first bullet point from the wikipedia article, that’s pretty much exactly what I was looking for:

  1. The function always evaluates the same result value given the same argument value(s).

I wanted a mechanism that every time it is run in the same way, gives you the same thing AND since it’s going to give you the exact same thing, just cache it and return it. After all, why calculate it twice?

Here is the code (with a couple tests for good measure):

class PureFunction  
  @@cache = {}  
  def self.cache(&block)  

    #file/line number of the call to this function  
    calling_line = caller.first  

    #remove any cruft from being evaluated within a block/proc  
    calling_line = calling_line[0..(calling_line.index(':in')||0)-1]  

    #return the cache hit if it can be found  
    return Marshal.load(@@cache[calling_line]) if @@cache[calling_line]  

    #if it is unfound, run the actual code  
    ret_val = yield  

    #and store it by creating a serialized copy so that we can rehydrate  
    #it without any of the mutations it's gone through later  
    @@cache[calling_line]=Marshal.dump(ret_val)  
    return ret_val  
  end  
  def self.clear  
    @@cache = {}  
  end  
end  

if __FILE__ == $0  
  def assert(message, &block); puts "#{"%6s" % ((yield || false) ? '  PASS' : '! FAIL')} - #{message}"; end  

  assert "cache should return results" do  
    "hello" == PureFunction.cache do  
      "hello"  
    end  
  end  

  a = [1,2]  
  2.times do  
    assert "cache should return the same value whenever it's called at the same line" do  

      #we want the initial value to always come out of the block regardless of whether the object was mutated  
      [1,2] == PureFunction.cache { a }  

    end  

    #we mutate a  
    a << 3  
  end  

end

There are two really interesting parts I want to point out:

  1. You always get back an untouched copy of the return value of the block
  2. You don’t have to give an arbitrary cache key because I use the spot where you called the method from to key the cache

So now my test code looks like this:

context 'a placement' do  
  setup do  
    @placement = PureFunction.cache do  
      Factory(:placement)
    end  
  end  

  should 'be placed?' do  
    assert_equal @placement.placed?  
  end  

  context "plan delegation" do  
    should "return its plan's name" do  
      @placement.plan.expects(:name).and_returns('Plan One')  
      assert_equal 'Plan One', @placement.plan_name  
    end  
  end  
end

You can see here that now we don’t even have to use fast_context. The Factory(:placement) line get run once and only once no matter how many levels of context we have and we don’t have to try to undo any mutations that the placement has gone through in our shoulds.

A caveat: You can’t use this with unserializable objects. (We ran into instances where BackgroundJob was creating a ton of singleton classes and cases where we tried to serialize objects that we had mocked.)

So that is all fine and good, but the key question is how did we do with that 120 second test? Well after stripping out any setup that we didn’t need, and combining all the setup under a few as possible PureFunction blocks, we got it down to:

ruby -I"lib:test" test/unit/placement_test.rb   
Current PID<5494>  
Loading fast_context  
..........................................................................................................................................................................  
Finished in 7.278199 seconds.  

170 tests, 274 assertions, 0 failures, 0 errors

At the end of the three-day timebox, we had some pretty nice results. The worst unit test went from 120 seconds to under 7 seconds (a 94% reduction) and we have an approach to try with with the rest of the unit test suite. With all the changes, the unit tests now clock in at just under 7.5 minutes (446 seconds), so there is still room for improvement as we continue into the next sprint.