Binary Search's infamous pitfall

#compsci (2)
Thumbnail

Exploring one of the most infamous pitfalls in binary search, and the importance of understanding it.

## Introduction

I was prompted to write this after watching Computerphile’s recent video on the binary search algorithm. I generally enjoy Dr. Mike Pound’s videos a lot, so I thought I’d play this one in the background, even if the topic is painfully familiar to me. All in all, Dr. Pound does a fine job at explaining the general concept of the algorithm and why it’s so ingenious but omitted what I think is a small but very significant detail.

For those of you who are unfamiliar with the binary search algorithm, I’d suggest watching the video first. For those of you who want a TL;DW, binary search is a search algorithm for finding the position of a target value within a sorted array in an optimal way. The general idea is to split the search space in half every iteration, comparing the middle point and our target value. Since we’re searching in a sorted array, one of three things can happen:

  1. The middle point is smaller than our target value, in which case we know our target value cannot possibly be anywhere to the left of the middle point.
  2. The middle point is bigger than our target value, in which case the target cannot be on the right of the middle point.
  3. The middle point happens to be our target, in which case we’re done.

The most common definition for mid is as follows:

mid=left+right2mid = \left\lfloor \frac{left + right}{2} \right\rfloor

Binary search is much better than linear search for sorted data, as it uses the fact that all elements are ordered to halve the search space at every iteration, achieving a worst-case complexity of O(log2(n))\text{O}(\log_2(n)), as opposed to linear search’s O(n)\text{O}(n) complexity. It is a brilliant yet simple algorithm that makes searching through billions and billions of elements a trivial task.

To put in perspective how much better binary search is than linear search at scale, for an array with 4 billion elements, binary search has to do only 32 iterations, whereas linear search will have to look at all 4 billion elements in the worst-case scenario. If we double the search space, i.e. if we have 8 billion elements, binary search will need only one extra iteration, which will barely affect its runtime, while the runtime of linear search will double.

Of course, keeping the data sorted is an entirely different problem, which I won’t get into in this blog post.

## The problem

Binary search is notoriously tricky to implement correctly, despite being so simple. In this blog post, we’ll look at one of the infamous pitfalls one can fall into while implementing it, in particular the one that the Computerphile video failed to mention.

Let’s look at a possible implementation of binary search. I’ve chosen C++ for this, but excluding the funky syntax for generics and references, the implementation should be easy to understand regardless of your experience with C++:

template <class T>
bool binary_search(std::vector<T> const& values, T target) {
  int left = 0, right = values.size() - 1;

  while (left <= right) {
    int mid = (left + right) / 2;
    if (values[mid] < target) {
      left = mid + 1;
    } else if (values[mid] > target) {
      right = mid - 1;
    } else {
      return true;
    }
  }

  return false;
}  

At first glance, it might seem perfectly fine, and in practice, it’ll work fine for any small example you can come up with, but I promise, there’s something wrong with it. And no, it isn’t a logic error (although I have to admit, it took me an embarrassing amount of attempts to realize I had the cases backward at first), the algorithm itself is perfectly fine; the issue is more subtle than that.

The problem is in the way mid is calculated. It might not be obvious at first, but the expression (left + right) / 2 can overflow. This can happen if left and right are large enough to overflow the int type. With a little bit of math, we can figure out that we can trigger an overflow if we manage to get left and right to both be equal to a number larger than or equal to 2302^{30}, as the maximum value a int can store is 23112^{31} - 1.

One might say that it’s unrealistic to be dealing with big enough arrays to trigger the issue, but I disagree, as the issue can be triggered in the implementation above with 231/2+1=230+12^{31}/2 + 1 = 2^{30} + 1 32-bit integers1, which take up around 4GiB4 \text{GiB} of memory, which isn’t unreasonable, especially when we consider that the real strength of binary search is searching through large search spaces.

Let’s see if we can write some code to trigger it:

int main() {
  // Least verbose C++ STL code (random number generation)
  std::random_device rnd_device;
  std::mt19937 mersenne_engine{rnd_device()};
  std::uniform_int_distribution<int> dist{0,
                                          std::numeric_limits<int>::max() - 1};

  // Allocate the sear ch array with the appropriate size
  auto n = (1 << 30) + 1;  // 2^30 + 1
  std::vector<int> values(n);

  // Generate the search array
  std::cout << "Generating " << n << " values..." << std::endl;
  std::generate(values.begin(), values.end(),
                [&]() { return dist(mersenne_engine); });

  // Sort the search array
  std::cout << "Sorting..." << std::endl;
  std::sort(values.begin(), values.end());

  // Set a target that's bigger than everything in our search array
  auto target = std::numeric_limits<int>::max();

  std::cout << "Searching..." << std::endl;
  std::cout << binary_search(values, target) << std::endl;

  return 0;
}  

I’ll also add a log in the while loop so that we can observe the algorithm’s state:

  while (left <= right) {
    int mid = (left + right) / 2;
    std::cout << "left " << left << " right " << right << " mid " << mid
              << std::endl;
    // ...
  }  

As mentioned, the code in the main function above allocates 230+12^{30} + 1 32-bit integers between 00 and 23122^{31} - 22, sorts them, then calls binary_search with 23112^{31} - 1 as a target, which we know isn’t in the array, meaning that the binary search will have to repeatedly move the left index to the right until it becomes equal to right, which is exactly when we expect to run into an overflow.

Let’s try running it:

$ g++ -o binary_search -Ofast binary_search.cpp && ./binary_search
Generating 1073741825 values...
Sorting...
(omitting irrelevant logs to save vertical space)
left 1073741809 right 1073741824 mid 1073741816
left 1073741817 right 1073741824 mid 1073741820
left 1073741821 right 1073741824 mid 1073741822
left 1073741823 right 1073741824 mid 1073741823
left 1073741824 right 1073741824 mid -1073741824
zsh: segmentation fault  ./binary_search  

It runs for quite a long time (sorting 4 GiB worth of data is no joke), then boom! Segmentation fault!

The log we added to the binary_search loop is really helpful for understanding what’s going on: we can see how left is slowly creeping up in value as we check more and more numbers that are smaller than our target, until finally, left and right both become 10737418241\,073\,741\,824, which is 231/2=2302^{31}/2 = 2^{30}. Adding 2302^{30} to itself gives us 2312^{31} -- one bigger than the largest value we can fit in a 32-bit signed integer, which shoots us over to the negative numbers, causing a segfault.

## The solution

Let’s think about the situation mathematically. We’re trying to calculate the midpoint midmid of a range defined by leftleft and rightright. We know that leftmidrightleft \leq mid \leq right, so there must be a way to calculate midmid without overflow. What we need to do is come up with an alternative formula for midmid -- one that doesn’t “go through” any large intermediate values to arrive at the final result.

The problem lies in the addition of leftleft and rightright, so we’d be in business if we could find a way to rewrite the original formula in a way that avoids it. Since we know the final result will be larger than leftleft and smaller than rightright, we should be able to arrive at the result if we add some non-negative number xx to leftleft. It should be easy to notice that xx must be sufficiently small so that left+x<=rightleft + x <= right, which in practice means that if we find a way to separate leftleft from the fraction, we’ll end up with an overflow-free expression.

Thankfully, a fairly simple transformation gets us exactly what we want (integer division is assumed to avoid the noise of adding floor everywhere):

mid=left+right2=2×leftleft+right2=left+rightleft2mid = \frac{left + right}{2} = \frac{2 \times left - left + right}{2} = left + \frac{right - left}{2}

Have I lost you? No? Good.

So, this is the “math” behind the solution to our overflow problem. But does it work in practice? Only one way to find out! Let’s modify our proof-of-concept program and check if we can still trigger a segfault:

  while (left <= right) {
    // int mid = (left + right) / 2;
    int mid = left + (right - left) / 2;
    std::cout << "left " << left << " right " << right << " mid " << mid
              << std::endl;

    // ...
  }  

This is the output we get when we try running the program now:

$ g++ -o binary_search -Ofast binary_search.cpp && ./binary_search 
Generating 1073741825 values...
Sorting...
(omitting irrelevant logs to save vertical space)
left 1073741809 right 1073741824 mid 1073741816
left 1073741817 right 1073741824 mid 1073741820
left 1073741821 right 1073741824 mid 1073741822
left 1073741823 right 1073741824 mid 1073741823
left 1073741824 right 1073741824 mid 1073741824
0  

Awesome! As you can see, our test program no longer segfaults and correctly returns false, as our target isn’t contained in the array. What if we set the last element to our target? Will our algorithm find it? Let’s see!

Let’s set the last element to target:

  // Setting a target that's bigger than everything in our search array
  auto target = std::numeric_limits<int>::max();

  values.back() = target;

  // ...  

If we re-run, this is the output we see:

$ g++ -o binary_search -Ofast binary_search.cpp && ./binary_search 
Generating 1073741825 values...
Sorting...
(omitting irrelevant logs to save vertical space)
left 1073741809 right 1073741824 mid 1073741816
left 1073741817 right 1073741824 mid 1073741820
left 1073741821 right 1073741824 mid 1073741822
left 1073741823 right 1073741824 mid 1073741823
left 1073741824 right 1073741824 mid 1073741824
1  

As expected, our binary search successfully finds the target right at the very end of the array. Problem solved! Well, at least for search spaces up to 2312^{31}

If we want to go even bigger, the int data type will no longer cut it. I kept the indices as 32-bit signed integers for simplicity’s sake. Had I used size_t (64-bit unsigned integers), the problem would’ve very much still existed (albeit not as spectacular3), but the reproduction would’ve required a much bigger search space. For completeness though, let’s switch to a bigger data type so that we can handle even bigger inputs:

template <class T>
bool binary_search(std::vector<T> const& values, T target) {
  size_t left = 0, right = values.size() - 1;

  while (left <= right) {
    size_t mid = left + (right - left) / 2;
 
    // ...
  }
}  

Alternatively, we could've solved this by adding a size check If for whatever reason you’d prefer if your binary search used 32-bit integers (I suppose performance could be a reason, but binary search is already stupidly fast, so differences would be negligible on modern hardware), checking the size of values before proceeding is an option.

Making assumptions like this is perfectly fine if we're solving some specific problem. If you knew you wouldn't be dealing with search spaces as large, fine, but please, verify your assumptions.

## The takeaways

Programming is hard. As a project grows in complexity, the surface area for bugs increases drastically. As developers, we're like architects in a digital landscape, and the integrity of our structures (software), relies heavily on the bedrock of thoughtful design and careful execution. Being aware of the implications of our design and implementation decisions is crucial because these choices are like dominoes; a single misstep can trigger a cascade of vulnerabilities, each with the potential to compromise our work and user trust.

To some of you, this error might seem extremely niche and unlikely to happen in practice, but it’s important to look at the bigger picture. Imagine this as some small, harmless-looking utility function in the context of a large project. Back when you wrote that utility function, you might not have seen anything wrong with it, just as our initial binary search implementation looked and worked completely fine. But under just the right conditions, this very function could bring down our entire app. This is why it’s important to “sweat” the details.

Also, to expand on the last paragraph of the previous section about verifying your assumptions, I can't stress how important this is. Making assumptions is fine, but not checking them is just asking for trouble. It can save you by catching errors caused by the violation of said assumptions early before they can cause any harm. As much as you want to convince yourself that the single if statement involved in doing so will be the bottleneck of your project, trust me, it won't.

Oh, and speaking of harmless-looking utility functions in the context of large projects, the Arrays#binarySearch function in Java had this very issue some 10 years ago! Look it up, I'm not kidding. I guess this is another mini-lesson about not blindly trusting libraries.

Unfortunately, there's no “silver bullet” for all of our problems (sorry, Rust shills, but rewriting it in Rust ain’t it). The only solution is this: awareness. Be aware that sometimes the trickiest errors aren't caused by complexity, but by “simplicity”. Sometimes, a single addition is enough.


1

Had I chosen a different type for left and right, e.g. unsigned int or even size_t, the issue would’ve been harder to trigger, as it would require a much larger array, but throwing bigger data types at the problem isn’t exactly a solution, especially for general implementations of binary search, where you don’t have a real constraint on the data.

Another thing to consider is that binary search is a general search algorithm that can be used to search through any search space, regardless if it’s a large amount of contiguous memory like an array. For example, one could use binary search to look through a big file of sorted data, where your only real limit is the amount of storage you have. An overflow could easily happen in a scenario like that as well.

2

I’ve chosen the range for the values somewhat arbitrarily. There’s no special meaning behind 23122^{31} - 2, I just made the upper bound one less than the target, which I chose to be 23112^{31} - 1. For this to work reliably, the target mustn’t be part of the possible values to avoid finding it earlier due to it showing up more than once.

3

The segfault comes from the fact we’re trying to access a negative index, which shoots us into memory we can’t access. If an unsigned data type was to be used, the worst that would happen is an incorrect result (given that our search space needs to be sufficiently large, an overflow of a signed data type would’ve just given us an incorrect index, but we’d still be accessing valid memory), which isn’t as crazy as a segfault.