Reinforcement Learning – Adit Deshpande – CS Undergrad at UCLA ('1. This is the 2nd installment of a new series called Deep Learning Research Review. Every couple weeks or so, I’ll be summarizing and explaining research papers in specific subfields of deep learning. This week focuses on Reinforcement Learning. Last time was Generative Adversarial Networks ICYMIIntroduction to Reinforcement Learning. Categories of Machine Learning Before getting into the papers, let’s first talk about what reinforcement learning is. The field of machine learning can be separated into 3 main categories. Supervised Learning. Unsupervised Learning. Reinforcement Learning. The first category, supervised learning, is the one you may be most familiar with. It relies on the idea of creating a function or model based on a set of training data, which contains inputs and their corresponding labels. Convolutional Neural Networks are a great example of this, as the images are the inputs and the outputs are the classifications of the images (dog, cat, etc). Unsupervised learning seeks to find some sort of structure within data through methods of cluster analysis. One of the most well- known ML clustering algorithms, K- Means, is an example of unsupervised learning. Reinforcement learning is the task of learning what actions to take, given a certain situation/environment, so as to maximize a reward signal. The interesting difference between supervised and reinforcement learning is that this reward signal simply tells you whether the action (or input) that the agent takes is good or bad. Learning can be exciting when playing these fun learning games. Membership is not required and the games are free. From classroom games and simulations to mobile gaming and reinforcement, our business solutions are uniquely customized and specifically engineered to meet your. Browse our library of learning games, guaranteed to bring classroom fun! It doesn’t tell you anything about what the best action is. Contrast this to CNNs where the corresponding label for each image input is a definite instruction of what the output should be for each input. For example, an agent’s action of moving left instead of right means that the agent will receive different input from the environment at the next time step. Let’s look at an example to start off. The RL Problem So, let’s first think about what have in a reinforcement learning problem. Let’s imagine a tiny robot in a small room. We haven’t programmed this robot to move or walk or take any action. It’s just standing there. This robot is our agent. Like we mentioned before, reinforcement learning is all about trying to understand the optimal way of making decisions/actions so that we maximize some reward. R. This reward is a feedback signal that just indicates how well the agent is doing at a given time step. The action A that an agent takes at every time step is a function of both the reward (signal telling the agent how well it’s currently doing) and the state S, which is a description of the environment the agent is in. The mapping from environment states to actions is called our policy P. The policy basically defines the agent’s way of behaving at a certain time, given a certain situation. Now, we also have a value function V which is a measure of how good each position is. This is different from the reward in that the reward signal indicates what is good in the immediate sense, while the value function is more indicative of how good it is to be in this state/position in the long run. Finally, we also have a model M which is the agent’s representation of the environment. This is the agent’s model of how it thinks that the environment is going to behave. Markov Decision Process So, let’s now think back to our robot (the agent) in the small room. Our reward function is dependent on what we want the agent to accomplish. Let’s say that we want it to move to one of the corners of the room where it will receive a reward. The robot will get a +2. We basically want the robot to get the corner as fast as possible. The actions the agent can take are moving north, south, east, or west. The agent’s policy can be a simple one, where the behavior is that the agent will always move to the location with the higher value function. Makes sense right? A position with a high value function = good to be in this position (with regards to long term reward). Now, this whole RL environment can be described with a Markov Decision Process. For those that haven’t heard the term before, an MDP is a framework for modeling an agent’s decision making. It contains a finite set of states (and value functions for those states), a finite set of actions, a policy, and a reward function. Our value function can be split into 2 terms. State- value function V: The expected return from being in a state S and following a policy . This return is calculated by looking at summation of the reward at each future time step (The gamma refers to a constant discount factor, which means that the reward at time step 1. Action- value function Q: The expected return from being in a state S, following a policy ? Well, we want to solve it, of course. By solving an MDP, you’ll be able to find the optimal behavior (policy) that maximizes the amount of reward the agent can expect to get from any state in the environment. Solving the MDP We can solve an MDP and get the optimum policy through the use of dynamic programming and specifically through the use of policy iteration (there is another technique called value iteration, but won’t go into that right now). The idea is that we take some initial policy . The way we do this is through the Bellman expectation equation. This equation basically says that our value function, given that we’re following policy . If you think about it closely, this is equivalent to the value function definition we used in the previous section. Using this equation is our policy evaluation component. In order to get a better policy, we use a policy improvement step where we simply act greedily with respect to the value function. In other words, the agent takes the action that maximizes value. Now, in order to get the optimal policy, we repeat these 2 steps, one after the other, until we converge to optimal policy . The MDP essentially tells you how the environment works, which realistically is not going to be given in real world scenarios. When not given an MDP, we use model free methods that go directly from the experience/interactions of the agents and the environment to the value functions and policies. We’re going to be doing the same steps of policy evaluation and policy improvement, just without the information given by the MDP. The way we do this is instead of improving our policy by optimizing over the state value function, we’re going to optimize over the action value function Q. Remember how we decomposed the state value function into the sum of immediate reward and value function of the successor state? Well, we can do the same with our Q function. Now, we’re going to go through the same process of policy evaluation and policy improvement, except we replace our state value function V with our action value function Q. Now, I’m going to skip over the details of what changes with the evaluation/improvement steps. To understand MDP free evaluation and improvement methods, topics such as Monte Carlo Learning, Temporal Difference Learning, and SARSA would require whole blogs just themselves (If you are interested, though, please take a listen to David Silver’s Lecture 4 and Lecture 5). Right now, however, I’m going to jump ahead to value function approximation and the methods discussed in the Alpha. Go and Atari Papers, and hopefully that should give a taste of modern RL techniques. All Educational Software.com: Buy discount educational software for kids and adults with free shipping. Learn to type, language software, and fun math games. KS2 science activities, tests and notes for primary school children studying living things, materials and physical processes. Create and print worksheets, chore charts, and arts and crafts ideas. Kids earn points from doing chores, worksheets, and arts and crafts to use to adopt a pet cat. Instructors have reported this as an area. How computers can learn to get better at playing games. A site for AI researchers and game programmers.The main takeaway is that we want to find the optimal policy . Look at the above Q equation. We’re taking in a specific state S and action A, and then computing a number that basically tells us what the expected return is. Now let’s imagine that our agent moves 1 millimeter to the right. This means we have a whole new state S’, and now we’re going to have to compute a Q value for that. In real world RL problems, there are millions and millions of states so it’s important that our value functions understand generalization in that we don’t have to store a completely separate value function for every possible state. The solution is to use a Q value function approximation that is able to generalize to unknown states. So, what we want is some function, let’s call is Qhat, that gives a rough approximation of the Q value given some state S and some action A. This function is going to take in S, A, and a good old weight vector W (Once you see that W, you already know we’re bringing in some gradient descent ). It is going to compute the dot product between x (which is just a feature vector that represents S and A) and W. The way we’re going to improve this function is by calculating the loss between the true Q value (let’s just assume that it’s given to us for now) and the output of the approximate function. After we compute the loss, we use gradient descent to find the minimum value, at which point we will have our optimal W vector. This idea of function approximation is going to be very key when taking a look at the papers a little later. Just One More Thing Before getting to the papers, just wanted to touch on one last thing. An interesting discussion with the topic of reinforcement learning is that of exploration vs exploitation. Exploitation is the agent’s process of taking what it already knows, and then making the actions that it knows will produce the maximum reward. This sounds great, right? The agent will always be making the best action based on its current knowledge. However, there is a key phrase in that statement. Current knowledge. If the agent hasn’t explored enough of the state space, it can’t possibly know whether it is really taking the best possible action. This idea of taking actions with the main purpose of exploring the state space is called exploration. This idea can be easily related to a real world example. Let’s say you have a choice of what restaurant to eat at tonight. REINFORCEMENT LEARNING AND POMDPs. REINFORCEMENT LEARNING IN PARTIALLY OBSERVABLE WORLDS. Realistic environments are not fully observable. The essential question. To address this issue, Schmidhuber. OOPS or other methods. Schmidhuber. Schmidhuber. Kompella, M. Stollenga, M. Schmidhuber. Continual curiosity- driven skill acquisition from high- dimensional video inputs for humanoid robots. Artificial Intelligence, 2. Doi: 1. 0. 1. 01. Schmidhuber, F. Srivastava, F. Schmidhuber. Coello Coello, V. Pavone, eds.. 1. 2th Int. Schmidhuber. Controzzi, C. Cipriani, A. Foerster, M. Carrozza, J. Schmidhuber. Schmidhuber. International Conference on Machine Learning ICML 2. Edinburgh. Schmidhuber. Kompella, M. Stollenga, L. Schmidhuber. IEEE Conference on Development and Learning / Epi. Rob 2. 01. 2. (ICDL- Epi. Rob'1. 2), San Diego, 2. Srivastava, B. Steunebrink, J. Schmidhuber. Continually Adding Self- Invented Problems to the Repertoire: First Experiments with Power. Play. IEEE Conference on Development and Learning / Epi. Rob 2. 01. 2. (ICDL- Epi. Rob'1. 2), San Diego, 2. Schmidhuber. International Conference on Artificial Neural Networks (ICANN 2. Lausanne, 2. 01. 2. Foerster, J. Schmidhuber. Schaul, J. Schmidhuber. Schmidhuber. Incremental Basis Construction from Temporal Difference Error. Proceedings of the 2. International Conference on Machine Learning (ICML- 1. Graziano, J. Schmidhuber. Schaul, Yi Sun, D. Wierstra, F. Schmidhuber. Curiosity- Driven Optimization. Schmidhuber, F. Schmidhuber. Graziano, M. Schmidhuber. Schmidhuber. Glasmachers, J. Schmidhuber. Graziano, J. Schmidhuber. Steunebrink, J. Schmidhuber. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1. Schaul and J. Schmidhuber. Metalearning. Scholarpedia, 5(6): 4. Osendorfer, T. Schmidhuber. Neural Networks 2. Wierstra, S. Schmidhuber. Journal of Machine Learning Research (JMLR). Wierstra, S. Schmidhuber. Wierstra, A. Schmidhuber. Schmidhuber. Schmidhuber. Journal of SICE, 4. Extended version. Xiv: 0. 81. 2. 4. PDF (Dec 2. 00. 8). PDF (April 2. 00. Wierstra, A. Foerster, J. Schmidhuber. Wierstra, T. Schmidhuber. Togelius, T. Wierstra, C. Schmidhuber. Togelius, J. Schmidhuber. Schmidhuber, R. Miikkulainen. Schmidhuber. In. Knowledge- Based Intelligent Information and. Engineering Systems KES- 2. Lecture Notes in Computer Science LNCS 5. Springer, 2. 00. 8. Schmidhuber. Schaul and J. Schmidhuber. Osendorfer, T. Peters, and J. Schmidhuber. Neruda, editors. Proceedings of the. International Conference on Artificial Neural Networks ICANN- 2. ICANN 2. 00. 8, Prague, LNCS 5. Springer- Verlag Berlin Heidelberg, 2. Wierstra, T. Schmidhuber. Neruda, editors. Proceedings of the. International Conference on Artificial Neural Networks ICANN- 2. ICANN 2. 00. 8, Prague. Springer- Verlag Berlin Heidelberg, 2. Wierstra, T. Schmidhuber. Fitness Expectation Maximization. Wierstra, T. Schmidhuber. Wierstra, J. Schmidhuber. Wierstra, A. Foerster, J. Schmidhuber. Solving Deep Memory POMDPs. Recurrent Policy Gradients. Schmidhuber. Schmidhuber, and R. Miikkulainen (2. 00. Schmidhuber. 2. 23- 2. Springer- Verlag Berlin Heidelberg, 2. Gomez and J. Schmidhuber. Springer- Verlag Berlin Heidelberg, 2. Bakker and J. Schmidhuber. Bonarini, E. Yoshida, and B. Schmidhuber. Zhumatiy, G. Gruener, and J. Schmidhuber. Schmidhuber. Obermayer, eds.. Advances in Neural Information Processing Systems 1. NIPS'1. 5. MIT Press, Cambridge MA, p. Schmidhuber. Schmidhuber's CSEM grant 2. Giles, eds.. Sequence Learning: Paradigms, Algorithms, and Applications. Springer, 2. 00. 1. Schmidhuber. Dorffner, H. Hornik, eds.. Proceedings of Int. Wiering and J. Schmidhuber. Salustowicz and M. Wiering and J. Schmidhuber. Schmidhuber, J. Zhao, and M. Schraudolph. Thrun and L. Pratt, eds.. Learning to learn, Kluwer, pages 2. Salustowicz and J. Schmidhuber. Wiering and J. Schmidhuber. Saitta, ed.. Machine Learning. Proceedings of the 1. International Conference. Morgan Kaufmann Publishers, San Francisco, CA, 1. Wiering and J. Schmidhuber. Schmidhuber. Moody, and D. Touretzky, editors. Advances in Neural Information Processing Systems 3, NIPS'3, pages 5. San. Mateo, CA: Morgan Kaufmann, 1. Schmidhuber and R. Simula, and J. Kangas, editors. Artificial Neural Networks, pages 3. Elsevier Science Publishers. B. V., North- Holland, 1. Touretzky, J. Elman. T. Sejnowski, and G. Hinton. editors, Proc. San Mateo, CA: Morgan Kaufmann, 1. IEEE/INNS International Joint Conference on Neural. Networks, San Diego, volume 2, pages 2. INNC International Neural Network Conference, Paris. Eckmiller, G. Hartmann, and G. Hauske, editors, Parallel. Processing in Neural Systems and Computers, pages 2. North- Holland. 1. Kindermann and A. Linden, editors, Proceedings of. Distributed Adaptive Neural Information Processing', St. Augustin, 2. 4.- 2. Oldenbourg, 1. 99. Schmidhuber. Gruener, J. Schmidhuber. Wiering and J. Schmidhuber. Schmidhuber. Pfeiffer, B. Blumberg, J. Wilson, eds.. From Animals to Animats 5: Proceedings. Fifth International Conference on Simulation of Adaptive. Behavior, p. 2. 23- 2. MIT Press, 1. 99. Hochreiter, and J. ICANN'9. 5, vol. 2, pages 1. International Joint Conference on Neural Networks. Singapore, volume 2, pages 1.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
November 2017
Categories |