强化学习docker环境配置

2 minute read

Published:

强化学习docker环境配置,使用rllib强化学习算法库,使用tensorflow2和pytorch深度学习框架。

1.在宿主机安装docker和nvidia-docker

  • 安装docker
# 卸载旧版docker
sudo apt-get remove docker docker-engine docker.io containerd runc 
sudo apt-get update
sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

#获取docker官方GPG KEY 
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

#测试安装
sudo docker run hello-world

其他安装选项,例如安装nightly或者test版本,或者使用deb方式安装,请参考Install Docker Engine on Ubuntu

  • 安装nvidia-docker2

nvidia-docker可以在docker中使用gpu、cuda、cudnn等,在使用docker run命令时,改成nvidia-docker run即可使用gpu。

#获取GPG KEY
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

#重启docker daemon以完成安装过程
sudo systemctl restart docker

#测试安装
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

上述的测试安装指令会输出如下的内容:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2.编写Dockerfile

文件名:Dockerfile


FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu20.04

LABEL maintainer="z19040042@s.upc.edu.cn"

RUN apt update && apt install -y python3 \
python3-pip \
swig\
&& rm -rf /var/lib/apt/lists/*

COPY ./requirements.txt /

RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

RUN pip install -r requirements.txt

WORKDIR /

ADD test*.py /

requirements.txt文件中添加如下内容

torch
torchvision
tensorflow
ray
ray[rllib]
ray[tune]
gym
gym[classic_control]
gym[box2d]

Dockerfile编写最佳实践参考Dockerfile 最佳实践

编写test_gym.pytest_torch.pytest_tf.py文件。

#test_gym.py

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()
# test_torch.py

import torch
import torchvision

print(f"{torch.__version__=}")
print(f"{torchvision.__version__=}")

print(f"{torch.cuda.is_available()=}")
# test_tf.py

import tensorflow as tf

print(f"{tf.__version__=}")
print(f"{tf.config.list_physical_devices('GPU')=}")

将上述所有文件放在同一文件夹中,文件夹命名为rldocer

3.build镜像

rldocker目录下执行

sudo docker image build -t rldocker:latest .   # 不要忘记 .  代表Dockerfile在当前目录下

4.发布镜像到docker hub

首先需要登录docker hub

docker login

然后重新tag一下镜像。如果没有重新tag镜像,此时执行sudo docker image ls列出的名字是rldocker:latest,此时发布到docker hub对应的地址是docker.io/rldocker:latest,但是我们并没有docker.io的权限,所以需要重新tag到自己的账户下,对应于笔者的docker hub,重新tag的指令是

sudo docker image tag rldocker:latest borninfreedom/rldocker:latest

执行docker push推送

sudo docker image push borninfreedom/rldocker:latest

5.使用docker image

此时既可以自己按照上文编写Dockerfile来构建镜像,也可以直接拉取笔者已经推送的镜像。使用笔者仓库的镜像的指令是

sudo docker pull borninfreedom/rldocker:latest