![]() The experimental details were brief, although generally well-explained in the appendix. However, I am somewhat confused by the interaction between the paper and the supplement, as the supplement notes that Proposition 1 isn't quite right and says that this will be in the final version of the paper.Ĭlarity: The paper was clearly written, especially in the description of the CISR framework (up through the end of section 3). Figure (b) in the response helps to address that there is learning going on across students, but doesn't address whether there's a large proportion of the possible curricula that perform well and thus whether the learning task is relatively easy.Ĭorrectness: To the extent I could tell, the paper+supplement are correct as written. ![]() The point about the number of possible curriculum sequences makes sense, but I do not believe it necessarily addresses the point about a heuristic performing relatively well. I appreciate especially the inclusion of a comparison to, which I believe strengthens the empirical result. Thank you to the authors for their response. An additional, perhaps stronger comparison would be to a curriculum learner like, that is working on the same student. Understanding this would also help with understanding whether the time spent to optimize on previous learners was worth it. I would have been interested in looking at larger K (or scenarios where larger K are needed) and comparing to a baseline of, say, switching interventions at a uniform interval through the training, to understand how much the experiments are actually telling us about how well the optimization works. It seems possible that a heuristic would perform very well here. (2) In the experimental results, small K (K = 1 and K = 2) are used. I would also have liked to see results (perhaps in the appendix) for the sensitivity of the results to the number of learners prior to reaching an "optimized" point. It would be helpful to either include a clearer argument for the importance of learning the curriculum policy over other approaches or to discuss the possible limitations of needing to learn from many learners (and perhaps the robustness claims would mitigate this limitation to some extent). This fact is somewhat hidden in the experimental evaluations because the evaluation of the optimized sequence of interventions doesn't include the performance of the first 30-100 learners. In particular, this may mean that many agents must be trained prior to getting to a good curricular policy, and this may be problematic for practitioners. (1) The paper notes that the framework is different from prior curriculum learning work due to learning from prior learners, allowing it to be "data-driven rather than heuristic," but the consequences of that aren't explored. Weaknesses: There were two primary weaknesses that I noticed in the paper: ![]() These strengths are likely to make the work both of interest to the NeurIPS community and to have an impact on practitioners using these ideas in practice, especially given that while some elements of the framework have the potential to be computationally intense, less computationally intense elements can be used in the implementation (e.g., Bayesian optimization rather than a POMDP solver). While the experiments are brief, they nicely illustrate how the framework can be instantiated in practice. ![]() The framework is relatively general, with the ability to use different algorithms to fill in for the teacher and student, and the interventions are likely to be realizable in many real applications where safe RL is desired. Strengths: The strengths of this work center on how the authors lay out a clear, general framework for safe RL with somewhat simpler assumptions than in much of the prior literature. The paper develops the details of this framework, shows that it guarantees safety in training, and in two experiments, show that optimizing the curriculum for the interventions outperforms choosing a single intervention and the final policy is comparably effective to an agent trained without interventions (while maintaining safety). The teacher is also an online learner that optimizes the curriculum policy as it interacts with students. The teacher has a curriculum policy for how to sequence the interventions. It develops the CISR framework, which assumes the existence of an automated "teacher" that can intervene while the agent is learning to reset its state. Summary and Contributions: This paper focuses on the problem of safe RL, with the goal of developing a framework for online agent training where the agent only acts safely both during and after training. Review for NeurIPS paper: Safe Reinforcement Learning via Curriculum Induction NeurIPS 2020 Safe Reinforcement Learning via Curriculum Induction
2 Comments
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |