Deep Learning Image Recognition Using GPUs in Amazon ECS Docker Containers — Part II

Published in

CloudSight

3 min readApr 30, 2018

When we published the first article on how to set up Amazon ECS (Docker) using nvidia-docker / GPUs for inference (or training) Neural Networks in the cloud, we had no idea how popular the article would be. There certainly seems to be a bit of a gap in documentation available for setting this up, so we were happy to see how we could help so many users get started. It’s been some time since, and both Docker, NVIDIA, and AWS have improved their feature sets, allowing us to deploy into production much easier than before.

Previously, we needed to copy all sorts of drivers from the host machine into the container at runtime, but that had a bit of a “hackish” feel to it. Thankfully, NVIDIA has released nvidia-docker2 in the meantime, and Docker now allows the user to easily set a different runtime thereby simplifying this process to a bare minimum of host-level configuration, and zero configuration for the container. We can now easily ship containers to a heterogenous cluster without as much concern for deeper levels of compatibility as we had in the past, and magically, your favorite TensorFlow, Caffe, PyTorch, or other GPU-required image Just Works™.

The first step is preparing the underlying EC2 instance. You can do this a variety of ways, either by booting an image with the GPU drivers pre-installed, or by booting your favorite variety and installing them manually. In our case, we install both Docker/ECS and the CUDA drivers from scratch on a custom instance (covered below), but you can certainly vary the amount of customization on your “bare metal” to a wide degree.

The Copy-n-Paste Version

Most of the configuration comes down to the following snippet. You can add this to a user-data init script in your launch configuration or just paste it in a running instance to test. Once configured, and assuming ecs-agent is properly running again after restarting dockerd, you’re free to start a task in ECS with a container based on a TensorFlow, Caffe, PyTorch, etc. image:

On an Ubuntu image with CUDA drivers installed, this will add the nvidia-docker2 package and configure Docker to use the NVIDIA runtime instead of the default. This has the fantastic side-effect that ECS will start the container without any further customization of the image or running container itself. In other words, gone are the days of embedding scripts into the image, copying drivers into the running container, and attaching volumes that expose the host.

Finally, you’ll need to ensure your task definition enables privileged mode, but that’s as easy as adding the following to your task JSON in the “containerDefinitions” block:

"privileged": true

That’s it!

A Fully Custom Example

As mentioned above, you can add/remove the amount of customization based on the base AMI. Given a bare Ubuntu image (example ami-80861296), you can hit the ground running with the following user-data script in a Launch Configuration:

It’s fantastic to see NVIDIA improving their nvidia-docker2 project and making this integration process so easy for their users. Additional detail and setup instructions can be found here: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

How do you integrate GPUs in a production Docker environment? Excited to hear more in the comments below…

Deep Learning Image Recognition Using GPUs in Amazon ECS Docker Containers — Part II

The Copy-n-Paste Version

A Fully Custom Example

Written by Brad Folkens