By Martín Beyer

Quickly validate if SageMaker Neo will lower latency and costs. You have your new shiny model and want to lower latency and costs, let’s get up and running.


In this post we will:
  1. Save a trained Keras model
  2. Compile it with SageMaker Neo
  3. Deploy it to EC2


1. Compile model

Where you feed your model to Neo.

We will be using a pretrained InceptionV3 for this tutorial, but you can use any model you want or even your own custom one.


Note that Neo supports every popular DL framework, we will show the process for Keras but it can be applied to Tensorflow, PyTorch with minor changes, these differences are listed at


from keras.applications.inception_v3 import InceptionV3
from keras.layers import Inputmodel = InceptionV3(weights='imagenet', input_tensor=Input(shape=(224, 224, 3)))'InceptionV3.h5')


Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 224, 224, 3)  0                                            
conv2d_1 (Conv2D)               (None, 111, 111, 32) 864         input_1[0][0]                    


Here you should only focus on the last line, the only thing that matters from now on is that you have your model saved as an .h5 file.


Neo expects a compressed  .tar.gz file with the  .h5 inside, so:

tar -czvf InceptionV3.tar.gz InceptionV3.h5 


We are almost ready to compile our model, we just need to upload it to an S3 bucket and get it’s path.


If you’re not familiar with AWS S3 don’t worry, its like Google Drive but for devs. There are plenty of resources on how to use it, and it is only needed in this step.


Getting the S3 path
Getting the S3 path

Actually compiling the model

This part can be done in three ways:

  • AWS Console
  • AWS cli

We are taking the first route since the others require some setup. Go to the AWS Console, then to SageMaker and last to Compilation Jobs on the left bar.


Job settings:

Here you should simply give a name and create a role for the job (this sets up permissions to S3)

For ease choose Any S3 Bucket, and create role.

Giving job permissions

Input configuration:

Input configuration data

In Location of model artifacts you should input the path to the model you uploaded, it should start with  s3://. 

Data input configuration expects  {"<input_layer_name>": [1, <number_image_channels>, <image_height>, <image_width>]} . You can get this data by doing  model.summary() and looking for the first layer info:

Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 224, 224, 3)  0                                            
conv2d_1 (Conv2D)               (None, 111, 111, 32) 864         input_1[0][0]                    


Our model’s input layer is  input_1 and by looking at  (None, 224, 224, 3) we can get the rest of the numbers we need.

So in our case it would be  {"input_1": [1, 3, 224, 224]}, if you look carefully you will notice that the numbers are in different order, this is very important and won’t work otherwise.


If you are curious on why is that: Neo expects Keras model artifacts in NHWC (Number of samples, Height, Width, Number of Channels) but weirdly enough Data input configuration should be specified in NCHW (Number of samples, Number of Channels, Height, Width). For more info on this, including formats for PyTorch, Tensorflow, or if your model has multiple inputs, look at What input data shapes does Neo expect?


Output configuration:

At last, on S3 output location you should type the path to an S3 folder where you want the output artifacts to be saved.

On Target device you have to specify on what platform you are going to deploy this model to, this is crucial for the optimization to work properly. In this guide we are deploying on Cloud, if you are interested on deploying on Edge devices stay tuned for Part 2.


Why is this crucial? At the surface level it needs to know if it will be running on GPU o CPU to optimize the computation graph, if it’s going to run on GPU it may combine multiple layers, like  conv2d+bn+relu to avoid saving unnecessary intermediate results to memory, achieving lower latency and less memory footprint. And at a lower level, it will actually generate platform specific code like CUDA or LLVM IR.

All that is left is to hit Create, and wait for the compilation to be completed.


2. Deploy the model

Up to this point we had:

  1. Saved our Keras model
  2. Compiled the model

Let’s have a look at the outputs at each step; at step 1 we get a file containing all the model data, which can be used for inference by loading it with Keras’  model.load() . However at step 2 the artifacts Neo returns are no longer usable with Keras, they can only be ran using a specific neo runtinme.

This means that you only have to install Neo’s runtime on the target device, no need for keras or tf, which by the way saves a ton of space and installation overhead.


This may seem odd at first, but it is needed in order to achieve the lower latency. Why? Well when you save a model from Keras, you are actually saving a graph, this graphs specifies what operations should be done and in what order. An example may be feeding the input through a Dense layer, getting the output and passing it as an input to a Softmax layer.

Then when is time to make an inference, Keras reads the graph and runs the operations specified to give an output.

After Neo compiles the model, the graph now has operations that are not defined in Keras, like the prior example  conv2d+bn+relu , this means Keras would not now what code to run, this is one of the reasons for the need of a specific runtime.

In fact in the artifacts of the compiled model there is a file that has optimized code for every operation in the graph, that is then used by the runtime.


So, since we chose as Target device an  ml_m5  we are going to launch an ubuntu EC2  m5.large instance and install the necessary dependencies.


sudo apt-get update 
pip3 install dlr numpy

Then we have to upload the compiled files to the instance.

Download  InceptionV3-ml_m5.tar.gz  (from S3 output location), and extract it. To upload it you can use  scp on your machine like this:


scp -i "<path to pem file>" compiled.tar ubuntu@<ec2 ip>:~/compiled_model_neo.tar


Then on the instance we do some tidying up:

Note that the artifacts should be all inside a folder, in this case  inception.


tar xvf compiled_model_neo.tar
mkdir inception
mv compiled.params inception/compiled.params
mv compiled_model.json inception/compiled_model.json
mv inception/


Now all that is left to run the model is to create .py a file an run it.




import numpy as np
from dlr import DLRModel
# Load the compiled model
model = DLRModel('inception', 'cpu')
print("Input", model.get_input_names())
# Load an image stored as a numpy array
image = np.load('inception/image.npy').astype(np.float32)
input_data = {'input_1': image}
# Predict
out =
top1 = np.argmax(out[0])
prob = np.max(out)
print("Class: %s, probability: %f" % (top1, prob))



Final thoughts

Today we compiled and deploy a keras model to an EC2 instance, however you may want to deploy to an edge device, like a NVIDIA Jetson or a Raspberry Pi, this is not only possible, but also really benefits from the optimization since without it it may not even run.

Get in touch with one of our specialists. Let's discover how can we help you.
Training, developing and delivering machine learning models into production