Building intelligence, part two.

Building intelligence, part two.
Photo by Anton Atanasov / Unsplash

Last week I published a post about creating a simple actor that observed its environment and used hard-coded sense/action pairs to reach a goal.

This week's objective is to have the actor learn to build its own actions library from scratch. If it's stuck it should experiment, and try something new. In order to get this to work we need three things. Physics, experimentation, and a goal.

Physics

Physics normally is complicated and difficult to implement. But in the context of our maze, it's pretty simple. The actor can move along a path. It can't move into a wall or out of bounds. That's it.

Whatever the rules are in an environment, our intelligence will need to learn them. It does this via two mechanisms - experimentation and mimicking. Because our intrepid actor is the first of its kind, it can't mimic quite yet. But it can experiment!

Experimentation

Experimentation involves having a set of possible actions, and trying them out. In our case, the possible actions are:

enum Action: CaseIterable {
    case turnLeft
    case turnRight
    case moveForward
}

It's pretty simple - the actor can't do jumping jacks, or lick an ice cream cone (poor actor), but it can move. How does it know which action is good? Or which is bad? That's where goals come into play.

Goals

Goals are based on homeostatic conditions. There is our current state, our ideal state, and the difference between the two. An action that minimizes the difference is considered a "good" action, and one that increases the difference is considered a "bad" action because increasing the difference means falling out of balance, rather than tending towards a stable state.

If we take a cue from reinforcement learning, we can "reward" our program when it reaches the end of the maze. I'm sure someone with actual machine learning experience could explain how the reinforcement works, but in our case, what we'll do is pretty simple:

  1. When the actor reaches the reward, we'll add the previous sensory data/action to the "remembered actions" array so the actor doesn't forget it. After all, that action led to the reward!
  2. When the actor reaches a state that already exists in the "remembered actions" array, it'll remember the previous action - because that action got us closer to something we know leads to a reward!

It'll take a while, but we'll run the program, choosing random actions until the actor gets to the reward. We'll make sure that an actor can't move through walls, or break any other laws. Then, we'll run the program again, and again, until the actor has learned the most optimal path to the reward.

Ideally, for the maze that I posted previously, it should only take 10 actions (including turning actions):

 w a w w w w
 w p w w w w
 w p w w r w
 w p w w p w
 w p p p p w
 w w w w w w

So, to evaluate our learning process, we'll run the simulation 100 times and post the number of steps the actor took to get to the reward. Hopefully it gets better with time!

The Results:

After 10 iterations

After running the program, our actor quickly learns how to find the reward. It experiences some random variation at the beginning and then discovers more and more optimal behavior until it has found the best possible solution. That's good!

However, there's a major flaw with the algorithm. To resolve it, we'll need to investigate how the actor sees the world - primarily that it can get confused if it runs into an identical scenario, in a different part of the map. To explore how to resolve this issue and understand why it happens, check out the next part of the series!


Leave a comment on Reddit!