Introduction
The goal of this project is to compare three approaches to control a Quadcopter in the task of navigating to multiple waypoints without crashing
- Human Agents: A human directly controlls the drone and drives it to the destination using keyboard input
- PID: The Quadcopter is navigated autonomously using the feedback loop mechanism of the PID controller
- Reinforcement Learning Agent: This being the crux of the project using 2 reinforcement learning algorithms DQN (Q learning and SAC). Detailed explanation about Q learning is what is being concentrated more on the project.
Quadcoptes are generally controlled either by humans or by use of algorithms derived by control theory. PID controllers are often used to control the motor power of each propeller and by sensing information the balace is maintainted using feedback loop mechanism
Using reinforcement learning learning will be the next step in the evolution of autonomous vehicles. Given the state of the vehicle and the environment, the agent will learn to take actions that will maximize the reward. The agent will learn to navigate to the destination without crashing.
Environment
Our environment uses OPENAI gym to simulate the quadcopter. The environment is a 2D space with 2 propellers. The environment is initialized with a random position and the agent should reach the destination (here balloon) without crashing. The agent is given all the details such as:
- angle_to_up : angle between the drone and the up vector (to observe gravity)
- velocity : velocity of the drone
- angle_velocity : angle of the velocity vector
- distance_to_target : distance to the target
- angle_to_target : angle between the drone and the target
- angle_target_and_velocity : angle between the to_target vector and the velocity vector
- distance_to_target : distance to the target
Hence completing the physics of the environment.
Say about openai GYM OPEN AI gym is a toolkit for developing and comparing reinforcement learning algorithms. It is a python library that provides a variety of environments to test our reinforcement learning algorithms. It is a collection of test problems that are called environments. Each environment has its own purpose. The environment used in this project is the classic cartpole environment. The environment is a 2D space with 2 propellers. We use openai gym here because the environments provides features such as rendering the environment, getting the state of the environment, taking actions, etc.
Physics
At each step of the simulation, the agents have to provide values between -0.003 and 0.083 (values chosen arbitrarily) for the thrust of their left and right rotors. The thrust is then converted to a force vector that is applied to the drone. The drone is then moved according to the force vector and the physics of the environment. The physics of the environment is as follows:
Step 1: Calculate acceleration
\[ \begin{aligned} & \ddot{x}=\frac{-\left(T_L+T_\gamma\right) \sin (\theta)}{m} \\ & \ddot{y}=\frac{-\left(T_L+T_\gamma\right) \cos (\theta)}{m} \\ & \ddot{\theta}=\frac{\left(T_r-T_L\right) \cdot L}{m} \end{aligned} \]
Equations for acceleration \(( \ddot{x}, \ddot{y}, \ddot{\theta} )\): \(\ddot{x} = \frac{T_L + T_R}{m} \cos(\theta)\) \(\ddot{y} = \frac{T_L + T_R}{m} \sin(\theta)\) \(\ddot{\theta} = \frac{l}{I} (T_R - T_L)\)
Definitions of variables:
- \(\ddot{x}, \ddot{y}\): acceleration on the x and y axes.
- \((\ddot{\theta})\): angular acceleration.
- \((T_L, T_R)\): input thrust for the left and right propellers.
- \((\theta)\): angle of the drone with respect to the z-axis.
- \((m)\): mass of the drone.
- \((g)\): acceleration due to gravity.
- \((l):\) distance between the center of mass and the propeller
Step 2: Derive the speed and the position.
\[ \begin{aligned} & \dot{x}(t-1)=\ddot{x} \cdot d t+\dot{x}(t) \\ & x(t+1)=\dot{x} \cdot d t+x(t \end{aligned} \]
Scoring
The simulation runs at 60fps for 100 seconds (can be modified). The agent should reach the destination without crashing. If the agents goes out of the field of the simulation, it is considered as a crash. The agent respawns with a penalty wait time of 3 seconds per crash (this is a bad thing in competitive environments).
Agents
Human Agent
The human agent is simulated in a near Earth environment with the thrust being constant and the human controls by using his keyboard.
The values are:
- Thrust initialization: \(T_L = 0.04 \quad \text{and} \quad T_R = 0.04\)
- UP (incrementing \(T_L\) and \(T_R\) by 0.04): \(T_L = T_L + 0.04 \quad \text{and} \quad T_R = T_R + 0.04\)
- DOWN (decrementing \(T_L\) and \(T_R\) by 0.04): \(T_L = T_L - 0.04 \quad \text{and} \quad T_R = T_R - 0.04\)
- LEFT: \(T_L = T_L - 0.003 \quad \text{and} \quad T_R = T_R + 0.04\)
- RIGHT: \(T_L = T_L + 0.004 \quad \text{and} \quad T_R = T_R - 0.03\)
PID Agent
For the PID agent we assume the error for the position from the target it has to reach and the location of the drone and move it vertically or rotate it at certain angle to move at different position
- The vertical distance from the drone to target is sent to PID which outputs a optimal vertical speed
- The horizontal distance is used to modify the thrust of each rotor to move it in the left or right direction same as the HUMAN Agent the rotation logic remains same
—- PUT IMAGE HERE ——-
Reinforcement learning Agent
The environment in which the drone trains is called as DRONEENV and we use openAI gym for creating the agent. The observations are made easy for the agent to understand. For observation the agents is given the following information (current position and the speed to move is determined by the following inputs)
- angle_to_up : angle between the drone and the up vector (to observe gravity)
- velocity : velocity of the drone
- angle_velocity : angle of the velocity vector
- distance_to_target : distance to the target
- angle_to_target : angle between the drone and the target
- angle_target_and_velocity : angle between the to_target vector and the velocity vector
- distance_to_target : distance to the target
Before going deeper into this we must know some basic concepts of Reinforcement Learning
Introduction
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it is bound to learn from its experience.
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior in an environment to obtain maximum reward. In RL, the data is accumulated from machine learning systems that use a trial-and-error method. Data is not part of the input that we would find in supervised or unsupervised machine learning.
We have 4 main components when it comes to reinforcement learning.
Reward
- The rewards are scalar quantity which is a single number which can range from [−N,N] and sometimes it can have multiple outcomes so choosing them is the job of the agent
- The reward should be outside the control of the agent which means that the reward should come from the environment in which the agent is training this will make sure that the agent wont be able to simply add more reward
- The reward depends on the domain
- Having singular scalar value is good as its easy to estimate the model performance
- Reward should be bounded that is from -N to N
- Rewards must be given for each epoch or every single iteration
Policy
A policy is a rule that determines how an agent chooses an action in each state. It can be deterministic or stochastic, depending on whether the agent always selects the same action or randomly picks an action according to some probability distribution. Policy is a conditional probability which is given by \(\pi \left( a| s\right) =P( A_{t}= a| s_{t}= s)\) -> The following equation stands for the probablity that the agent will take action ‘A’ given that at time t the state is ‘S’
Return
Return is a measure of the total reward that an agent can expect to receive in the future, starting from a given state or state-action pair. It is often used as a target for learning the value functions that estimate how good a state or an action is. There are different ways to define return, depending on how the rewards are discounted or summed up. One common way is to use the discounted return, which is given by the formula:
\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} \]
- where \(G_t\) is the return at time step \(t\), \(R_{t+k+1}\) is the reward received at time step \(t+k+1\), and \(\gamma\) is a discount factor between 0 and 1 that determines how much the agent values the future rewards.
- Agent which uses only the first term is called as near-sighted agent.
Value Function
The value function \(V^\pi(s)\) is the expected value of the return \(G_t\) when the agent follows the policy \(\pi\) and starts from the state \(s\).
\[ V^\pi(s) = \mathbb{E}_\pi \left\{ G_t \mid S_t = s \right\} \]
The expectation \(\mathbb{E}_\pi\) is taken over the distribution of the possible trajectories that the agent can experience under the policy \(\pi\). The value function represents how good or desirable a state is for the agent, and it can be used to evaluate or improve the policy.
PID
P
P-I
P-D
P-I-D
MDP
Markov Decision Process is here used to characterize the the problem the Markov chain for finding the next state transition with respect to the expected reward is given by the following equation
\[ P\left(S_{t+1}, R_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1} \ldots S_0\right) \] So for optimally finding the next state we must know everything that has happened before This formulation can be written as following: \[ P\left(S_{t+1}, R_{t+1} \mid S_t, A_t\right) \]
Formulation
In our research, we make certain assumptions to simplify the calculations and make them more static. The primary assumption is given by the following equations:
\[ \begin{aligned} & P(S_{t+1}=S^{\prime}, R_{t+1}=v \mid S_t=S, A_t=a) -> {1} \\ & P(S^{\prime}, R \mid S, a) -> {2} \end{aligned} \]
Here, \(P(S^{\prime}, R \mid S, a)\) represents the stationary assumption.
To obtain the output, we need:
- \(S\), which represents the set of states, also known as the state space.
- \(A\), which represents the set of actions.
- \(\gamma\), which is used when we are considering discounted returns.
Instead of using a single joint distribution given by the equation, we can break down the equation into two parts:
- Transition probabilities \(P\left(s^{\prime} \mid s, A\right)\)
- Expected reward \(E\left(r \mid s, a, s^{\prime}\right)\)
We write it down to break down the joint distribution. Here, only \(E\left(r \mid s, a, s^{\prime}\right)\) is enough as we are optimizing rewards.
The formulation of Markov Decision Process (MDP) is given by the Bellman equation:
\[ S, A, P\left(s^{\prime} \mid S, A\right), E\left(r \mid s, a, S^{\prime}\right) \]
The Bellman Equation
The Bellman equation is central to the theory of MDPs. It provides a recursive formulation for the value function of a policy in an MDP. The value function represents the expected return (accumulated rewards) for an agent starting in a particular state and following a specific policy thereafter.
The Bellman equation for MDPs can be expressed in two forms: one for the state-value function V(s)
and another for the action-value function Q(s, a)
:
\[ \begin{aligned} & V(s) = \max_a \left\{R(s,a) + \gamma \sum P(s'|s,a) V(s')\right\} -> {3} \\ & Q(s,a) = R(s,a) + \gamma \sum P(s'|s,a) \max_{a'} Q(s',a') -> {4} \end{aligned} \]
In these equations:
V(s)
is the value of states
under a policy.Q(s, a)
is the value of taking actiona
in states
under a policy.R(s, a)
is the immediate reward received after transitioning from states
with actiona
.P(s'|s, a)
is the transition probability of landing in states'
when actiona
is taken in states
.γ
is the discount factor which determines the present value of future rewards.
The Bellman equation essentially states that the value of a state (or state-action pair) is equal to the immediate reward plus the discounted expected value of the next state (or state-action pair). This recursive nature of the Bellman equation allows us to solve MDPs using dynamic programming methods.
Value Iteration
- Value function become useful when the Markov property holds true because we assume value to state regardless of how we came to that particular state
- If Markov property is not met how we arrived at the state will influence what will happen in future
- In the Value iteration approach we wont need the past states
Value iteration is a powerful approach that eliminates the need for past states in the computation of the value function. This method is widely employed in reinforcement learning and dynamic programming.
Theorem
Let \(v^0 \in V\) and \(\varepsilon > 0\) be given. We derive the sequence \(\{v^n\}\) from \(v^{n+1} = LV^n\), where \(LV = \max_{\pi}\{r + \gamma P^{\pi}v\}\). Here, \(r\) represents the reward, \(\gamma\) is the discount factor, and \(P^{\pi}\) is the transition probability under policy \(\pi\).
Theorem Statement
Then:
\(v^n\) converges in the norm to \(v^*\).
For all finite \(N\) at which the conditions \[ \|v^{n+1} - v^n\| < \varepsilon \times \frac{(1-r)}{2r} \] hold for all \(n > N\).
The policy \(\pi\) is defined by \[ \pi(s) = \underset{a}{\operatorname{argmax}}\{E(r \mid s, a) + \gamma \sum_{s'} p(s' \mid s, a) \cdot v^{n+1}(s')\} \].
\(\|v^{n+1} - v^*\| \leq \varepsilon / 2\) when the condition \(\|v_\pi - v^*\| = \varepsilon\) is met, where \(v_\pi\) is \(\varepsilon\)-optimal.
The sequence \(\{v^n\}\) is derived through the recursive equation \[ v^{n+1} = LV^n \], where \(L\) represents the Bellman operator. The operator \(L\) is defined as \[ LV = \max_{\pi}\{r + \gamma P^{\pi}v\} \], capturing the maximum expected sum of rewards under any policy.
In the theorem, \(\varepsilon\) is used to set a threshold for the difference between consecutive values in the sequence \(\{v^n\}\) during the iterative process of value iteration. Specifically, condition (b) states that for all finite \(N\), if \[ \|v^{n+1} - v^n\| < \varepsilon \times \frac{(1-\gamma)}{2\gamma} \] holds for all \(n > N\), then the sequence is considered to have converged.
The theorem asserts the convergence of the sequence \(v^n\) to the optimal value function \(v^*\) under the conditions outlined in points (a) through (d). The proof involves analyzing the difference between consecutive values in the sequence and establishing convergence criteria.