![]() The specificities and advantages between each of them could be the subject of an entire article but we will stay focused on implementation today. For more details on how it works please refer to this article. After trying out the different agents it was clear that the most fitted for our task was the PPO (Proximal Policy Optimization) algorithm. In order not to overload this article we won’t go in detail into the agent selection nor how they work. Tensorforce has multiple agents (Deep Q-network, Dueling QDN, Actor-Critic, PPO …) already implemented so we don’t have to do it ourselves. ![]() Our CaptAIn needs to be able to understand a complex relation (our Transition function, which is unknown to him) between continuous states (position and velocity) and discrete actions (thrust and pitch) in order to maximize its long-term rewards. We now have a full environment! Agent Definition We can think of it as roughly giving him 1 dollar per meter of runway left in order to motivate him to use the least runway possible while also rewarding him for successfully taking-off. ![]() We will also reward him if it reaches its goal (reaching an altitude of 25m), the amount will be based on how much runway there was left when it took-off. We can think of it as withdrawing 1 dollar from the CaptAIn’s bank account for every second it takes to take-off. To achieve this we will be penalizing the agent for every time step it takes to reach the objective. We want our CaptAIn to take-off as quickly as possible, therefore we want to encourage the shortest take-off distance or the shortest take-off time (the two are directly related). They need to be designed to encourage the agent to perform our task, it needs to be encountered sufficiently frequently for the agent to learn quickly while being sufficiently precise for the agent to learn only the behavior we seek. is the vertical position of the plane above the runway and its horizontal position from the start of the runway. We will use it as an example to go through all the steps in detail. Let’s now customize our environment! For our first case, we will train CaptAIn (our AI pilot) to learn how to take-off. Execute processes the action chosen by the agent and collects the new state, the reward, and whether or not the agent reached a terminal state. Reset puts the environment back into its starting state in order for a new episode to start. In the example, the agent has the possibility to pick an action between 1,2,3 and 4 (where each action would then be processed by the environment to create the next step, for example, we could imagine the following relations in a grid world game, 1: Go up, 2: Go down, 3: Go left, 4: Go right). Actions defines the shape and type of the actions made available to the agent. In the example, it is represented by 8 floats (hence it is of 8 continuous dimensions). States defines the shape and type of the state representation. Init is self-explanatory and initializes the values needed to create the environment (see the doc for more details) Tensorforce’s custom environment definition requires the following information (from Tensorforce’s doc): It provides an implementation of state of the art agents which is easily customizable. This library allows us to create our custom environments fairly easily with a nice amount of customization. Environment definitionĪs mentioned earlier we will be using Tensorforce to implement our work. Therefore the agent will have to understand how its environment works to select the best actions. However, it is not known to the agent (otherwise it would become planification and not Reinforcement Learning). In our example of a virtual plane, the relationship between actions (thrust and pitch) and outcomes (reward and new state) is known since we have designed the physical model to compute them in the previous article. This fits perfectly our problem as controlling the plane can easily be described as the sequence of actions to take on thrust and pitch allowing for the best trajectories. The goal of Reinforcement Learning is to understand the relation between the actions, states, and rewards in order to find the best sequence of actions to take in any given situation. The agent starts in a given state, it takes an action and then observes the state of its environment and gets a reward (or punishment) for reaching this state. Reinforcement Learning mimics the real-life process of learning through trial and error. This sequence of actions is called a policy. Reinforcement Learning is the branch of Machine Learning that allows training an agent on how to select the best sequence of actions depending on the current state to maximize long term returns.
0 Comments
Leave a Reply. |