本次CS代写的主要涉及如下领域: Python代写,Algorithm代写,北美程序代写,加拿大程序代写,CMPT419/983代写,Simon Fraser University代写
CMPT 419/983 Assignment 3
- Due date: Dec. 2
- Submit zip file to CourSys
1) This question guides you through implementing the policy gradient
algorithm with average reward baseline.
Preparation:
- Install gym and TensorFlow for Python. Documentation can be
found at https://gym.openai.com/ and
https://www.tensorflow.org/install.
- Replace “cartpole.py” in gym with the version provided. The
included file “cartpole_stabilize.py” contains the skeleton code for
training a cartpole to achieve its goal of keeping its position centred
and pole upright.
The cartpole environment consists of a rotatable pole mounted on top of
a cart. The states of the system are the position and velocity (푥,푣) of the
cart, and the angular position and velocity (휃,휔) of the pole. The two
possible actions are to push the cart left or right with a constant force.
Our goal in this problem is to keep the cart’s position near zero and the
pole near upright for as long as possible. To encourage this, in the
custom environment defined in the provided “cartpole.py” file, the
cartpole system receives a reward of 1 for every time step in which its
state satisfies |푥|≤ 0. 5 and |휃|≤ 4 × 180 휋. Training episodes terminate when
the system state violates |푥|≤ 1. 5 or |휃|≤ 12 × 180 휋.
a) In the __init__ method of the agent class, define a policy network that takes
as input the state, has two fully hidden layers of the desired number of
neurons with ReLU activation, and outputs the probability distribution of
applying the two possible actions.
b) In the __init__ method of the agent class, compute the probability of
applying the actions in the input data.
c) In the __init__ method of the agent class, define the loss function such that
its gradient is ∇휃퐽(휃).
d) Complete the compute_advantage function, which should compute a list of
advantage values 퐴푡≔∑푡′≥푡훾푡′−푡푟(푠푡,푎푡)−푏 for every time step across a batch
of episodes, where 푏=피휏~푝(휏;휃)∑푡≥ 0 훾푡푟(푠푡,푎푡) is the average reward across the
batch of episodes. Note that the batch size is specified by the
“update_frequency” variable.
e) Complete the main part of the script (fill in the unmodified “cartpole_stabilize.py” at lines 73-78, 104-107, 122-124).
f) Produce several plots showing the state of cart-pole system at different snapshots in time for a well-performing episode.
g) Produce a plot showing sum of discounted reward in each episode vs. episode number.
2) EKF SLAM. Consider the Dubins Car, given by the dynamics
푥̇=푣cos휃
푦̇=푣sin휃
휃̇=휔
a) Suppose there are 10 landmarks, with fixed 푥- and 푦-positions. Derive a
discrete time model for the augmented states (푥,푦,휃,푚 1 ,...,푚 10 ) by using
forward Euler.
b) Derive the Jacobian of the above transition model.
c) Starting from the provided MATLAB code, implement the one-step EKF
algorithm to estimate the augmented state. You do not need to modify any
code in the sub-folders. For your information, the vehicle_dynamics folder
contains a simple vehicle dynamics simulator, and the vehicle_sensors folder
contains a vehicle sensor simulator that noisily measures the range and
bearing of landmarks up to some maximum distance.