Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost.
To reduce this computational burden, we introduce
BatchNorm now uses normalization moments from the union of both batches. These moments are not mismatched, as all inputs now belong to the same mixture distribution.
These changes take only a few lines of code.
@inproceedings{
bhatt2024crossq,
title={CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity},
author={Aditya Bhatt and Daniel Palenicek and Boris Belousov and Max Argus and Artemij Amiranashvili and Thomas Brox and Jan Peters},
booktitle={International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=PczQtTsTIX}
}
A 2019 arXiv version of this paper was titled CrossNorm: Normalization for Off-Policy TD Reinforcement Learning.