Knowing when to hold and fold ’em: the explore/exploit dilemma
I’ve been staring at the menu for over 10 minutes. I can feel the server’s eyes boring holes into the back of my head, urging me to hurry up and pick something. Still I sit and ponder; should I get my old favorite, the California burrito? Or maybe I should try something new like a chimichanga? Or perhaps I ought to try the fish tacos…the place is called Don Carlos’s Taco Shop after all. Hmm, no, I think I’ll just go with the old stand-by: “One California burrito please!”
You’ve probably found yourself in a similar situation, wondering what to order at your favorite restaurant, or perhaps fretting over whether you should even visit that restaurant versus the new place that just opened up. Or maybe you’ve been worrying over the decision to stay in your current job or look for a new, potentially better (but also potentially worse) job? This dilemma, between sticking with what works versus trying something new is not uniquely human. Animals (and fungi, and plants) must constantly confront this issue. Imagine a bear foraging for blueberries on a mountainside. When should it switch from one blueberry bush to the next, from one mountain to another, or from foraging for blueberries to digging for roots?
The trade-off between exploitation (sticking with what works) and exploration (trying something new) is a ubiquitous problem. Exploiting can obviously be beneficial, whether you’re a bear eating blueberries or me eating a burrito. Still, if all we did was exploit we would miss out on all sorts of new and potentially much better experiences; even though I really enjoy the burrito I might be missing out on the sublime experience that is the chimichanga. On the other hand, exploration is inherently risky. Sure, we might discover something that is even better, but we might also completely strike out.
Smokey and the n-armed bandit
How we manage this trade-off between exploration and exploitation has been the focus of a great deal of research. One way the explore/exploit problem has been studied is with the so-called “n-armed bandit” task (old slot machines with a single lever on the side -the arm- used to be called one-armed bandits due to their ability to quickly relieve gamblers of their money). Here “n” refers to the number of “bandits” a subject has to choose from – so someone picking between two slot machines would be doing a 2-armed bandit task. In a particular 2-armed bandit task, experimenters changed the probability of winning and the magnitude of reward over time. This meant that subjects needed to periodically explore both options to ensure they knew which option was currently the best.
Researchers found that subjects sought to maximize expected value (basically, the probability of reward multiplied by the amount of reward). fMRI revealed that exploration activated an area of the brain known as the frontopolar cortex (fig. 2), while exploitation activated typical reward-related areas (see this Neuwrite post for a little more on reward) like the striatum. In part, they take these results to mean that exploration could be something that requires a bit more executive control – essentially that the default is to exploit and that you have to exert some mental “effort” to shift into exploration in the n-armed bandit task .
Of course, the n-armed bandit task is something that we (hopefully) rarely encounter in our lives. Much more common evolutionarily is the problem of our bear friend from the introduction, who has to figure out how and when to shift between various blueberry patches. To model this foraging problem, scientists developed a task where monkeys were trained to pick food from virtual patches that would deplete over time as they continued to pick that patch. As food became increasingly sparse (i.e., it took them longer and longer to harvest food) within a specific patch, the monkeys needed to decide when to shift from exploiting their current patch to exploring and traveling to a new one. Interestingly, when researchers made the “travel” time between patches longer, they found that the monkeys would stay and exploit at the current patch for longer. This makes intuitive sense. If food is hard to come by, you should really try to harvest as much of it as you can before moving on. In contrast, if food is plentiful, you would want to save energy by only gathering all the easy-to-grab stuff before moving on and again, just harvesting the easy stuff.
But how does the brain manage the trade-off? When neurons fire an action potential, an electrical impulse is generated and this can be recorded with an electrode. Researchers implanted an electrode into the dorsal anterior cingulate cortex (a brain region implicated in switching between actions) and recorded the activity of these neurons while the monkeys performed the foraging task. They found a group of neurons whose rate of firing increased as the monkeys spent more time within a patch (fig. 3), and once these neurons reached a threshold level of firing, monkeys would leave the current patch. Though correlational, what this means is that these neurons could possibly reflect the decision to leave a patch and switch from exploitation and exploration.
But how does the brain deal with changes in the environment? What specific mechanisms might allow these neurons to respond to changing parameters like the travel time between patches? If, for example, monkeys were performing a version of the task with shorter travel times (meaning they explore more), what (if anything) would change about the firing of these neurons? There are several possible options (fig 4).
1) Shorter travel times to other patches could increase the rate at which these neurons increased their firing, meaning the neurons would reach the threshold for leaving the patch sooner.
2) Shorter travel times could lower the threshold that has to be reached to leave, meaning neurons would reach this lower threshold sooner.
3) Shorter travel times could increase how active these neurons are from start. so that neurons start from an elevated rate of firing, and therefore reach the threshold sooner.
4) Some combination of the above
Ultimately the researchers found that a combination of option 1 (faster rate of rise) and option 2 (lowered threshold) occurred on short travel time trials. What’s really cool about this is that, by getting into these mechanisms, the researchers might possibly have figured out general algorithms that manage the explore/exploit trade-off in foraging. Even simple organisms confront some variant of this problem. Slime molds have been found to modulate the explore/exploit trade-off based on the quality of the food (better food = more exploitation) , and plants “foraging” for water and nutrients with their roots likewise must tackle this problem . These simple organisms don’t even have a nervous system, let alone a “dorsal anterior cingulate cortex”, but they might solve this foraging dilemma using these same simple mechanisms, though perhaps implemented in a very different way.
The explore/exploit trade-off is truly fundamental. Humans, animals, plants, fungi, and single-celled organisms need resources to live, and there is an inherent difficulty in determining when it is optimal to switch between exploiting a current resource and exploring for a new one. The fundamental nature of this problem has recently come to the fore in machine learning and artificial intelligence. If you are, for instance, training a computer to play chess, you would want to ensure that it not only exploits moves and strategies that have worked in the past, but that it also has the flexibility to explore for new strategies. Aside from the fundamental contribution of this trade-off to survival, some researchers have recently tied disruptions in managing the explore/exploit dilemma to psychiatric disorders . Several disorders, including addiction, tourette’s syndrome, and obsessive compulsive disorder are, in part, characterized by the repetition of maladaptive behaviors (i.e., over-exploitation). Perhaps indicating that the balance between exploitation and exploration could be disrupted. Regardless of whether or not explore/exploit contributes to disorders, it is clear that it is an absolutely fundamental problem. So the next time you are fretting over what to order at a restaurant, recognize that you are confronting a problem just about as old as life itself.
 Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876–879. https://doi.org/10.1038/nature04766
 Hayden, B. Y., Pearson, J. M., & Platt, M. L. (2011). Neuronal basis of sequential foraging decisions in a patchy environment. Nature Neuroscience, 14(7), 933–939. https://doi.org/10.1038/nn.2856
 Latty, T. & Beekman, M. (2009). Food quality affects search strategy in the acellular slime mold, Physarum polycephalum. Behav. Ecol. 20, 1160–1167
 McNickle, G.G. & Cahill, J.F. Jr. (2009). Plant root growth and the marginal value theorem. Proc. Natl. Acad. Sci. USA 106, 4747–4751
 Addicott, M. A., Pearson, J. M., Sweitzer, M. M., Barack, D. L., & Platt, M. L. (2017). A Primer on Foraging and the Explore/Exploit Trade-Off for Psychiatry Research. Neuropsychopharmacology, 42(10), 1931–1939. https://doi.org/10.1038/npp.2017.108