How to convert any Machine Learning model for the iOS application and build it into the project

In this article, we will go through all steps of the process of how to build and train a custom machine learning model, convert it into proper type to work with iOS application, and then build a small iOS application using this model. We will do this based on an example of images style transfer.

Steps in this article:

  • How to handle ML on iOS
  • Knowing model architecture and converting it to the proper format
  • How does Style Transfer works and improvements for mobile
  • Training our own model
  • Building a simple iOS application to transform images with painting styles

How to handle ML models on iOS?

At WWDC 2017 Apple has released a CoreML. This is a framework dedicated to working with ML models on Apple platforms. Before this integrating any ML models into the iOS app was relatively complicated. CoreML requires to use models in .mlmodel format (which can be converted from the most popular ML model with Apples CoreML Tools). After conversion, it is very easy to use. You can just drag and drop the model into your project and it is immediately available for use. You can access it from code without any additional setup. If you click a model file XCode will show this interface where you can check model properties:

As you can see, it is very simple, intuitive, and user-friendly. It shows the most important properties to work with a model in a very clear and visual way. If you need more details you can click on the arrow next to the model name in the Model Class section. This will open a swift file where you can check directly the model code. For example, you can check how predict() function looks like.

CoreML is not the only way you can work with ML models on iOS but it is highly recommended from a few reasons:

  • Simplicity of use – as mentioned above the workflow with CoreML is maximally simplified and easy to use
  • Flexibility – because to work with CoreML you can use any trained by yourself model it gives flexibility and probably is enough to solve your problem
  • Performance – CoreML is designed in a way to optimize model calculations for apple devices architecture. The newest iPhones/iPads even have specialized chips just for accelerating ML computations
  • No need for server calls – if you run the model on the device there is no delay as if when it would have to be performed on a server-side. For example, if an application has to process images it can be very important to have an instant result. The other advantage here is data privacy – user’s data never leave the device and Apple is known from enhancing this approach

Knowing model architecture and converting it to the proper format

Let’s step into converting the model to the CoreML format. In our example, we will use Pytorch implementation of Fast Neural Style Transfer from official PyTorch examples. We will train our model to transfer the style of painting from Frida Khalo’s self-portrait. The style of the paintings is very characteristic and also a story behind this woman is very empowering, inspiring, and beautiful 🙂

Self-Portrait with Thorn Necklace and Hummingbird, 1940

To perform conversion we need to know network architecture, names of network layers, network input, and output shapes (we need to provide it to the converter), as well as data, preprocessing, and deprocessing pipelines. The last one is quite important because it is easy to forget about it because nothing will crash. But if you do forget it, the model won’t have original performance or may completely fail. Neural networks are very vulnerable if you try to process data differently comparing to ones on which they were trained.

This is because input images during training are transformed into tensors – multidimensional matrices of values. Later these tensors are transformed in a statistical way that improves and speeds up the training process. But later if you introduce data to this network in a completely different form – for example, color values will have a range from 0 to 255, and the network was trained on values between 0 and 1 then the output of the network will make no sense because it is not trained to operate on the totally different statistical distribution of data.

If it is supported to include preprocessing layers into .mlmodel you can do this otherwise, you will have to handle this in the code of the iOS app.

To know the input/output names and shapes you can use some built-in functions like summary() in Keras or just print() model in PyTorch. It will just print out all the layers and names etc.

Also, there are some more user-friendly visual tools to explore model structures. The most recommended one is Netron. You can check it’s GitHub page. It’s very easy to install. This visual representation is much more readable and intuitive for human eyes:

However this might be not enough. This may not say everything about the model (like pre and post-processing). So that’s why it’s good to use models with good documentation or train by yourself.

How does Style Transfer works and improvements for mobile

The original version of the algorithm was presented for the first time in 2015 by the team managed by Leon Gatys and published in a paper. Since then a lot of modifications to this algorithm appeared and a lot of mobile applications use that modified versions.

The algorithm takes an advantage of the way that convolutional neural networks work. This kind of network emerged from studies about how our brain processes visual perception and are used since the 1980s. David H. Hubel and Torsten Wiesel made research on animals and proved that neurons in the visual cortex have a small receptive field. It means that they react only to some very small part of the visual field. They showed that some neurons have larger fields on which they react and they detect more complex patterns that are combined from lower-level neurons. They also noticed that neurons are connected in a kind of hierarchy order where the neuron is connected only to a few neurons from the previous layer.

Their work was very impactful and in 1981 they received Nobel Prize in Physiology or Medicine. This also stared new research on neural networks that are not fully connected – not every neuron from the previous layer to every neuron in the next layer. The next milestone was in 1998 when the LeNet-5 architecture was published. It introduced two new building blocks: convolutional layers and pooling layers.

Most typical CNN architecture
Source: https://www.researchgate.net/figure/Schematic-diagram-of-a-basic-convolutional-neural-network-CNN-architecture-26_fig1_336805909

Convolutional layer: neurons in the first layer are organized in such a way that they are not connected to every pixel of input but just a few of them that are responsible for their receptive field.

The blue layer is the image layer (convolutional networks are not only used on images but here we will focus only on this most common usage). We see that window that slides on it is in shape 3×3 (and with stride = 1 because it moves just one pixel on the side). We call it 3×3 convolutional layer. The green output is called a feature map that is later sent to another convolutional or pooling layer.

Pooling layer: the goal of this layer is to shrink the size of the feature map. There a few different pooling techniques but the most popular is called “max pooling” and works like this:

Max pooling with a 2×2 filter and stride = 2
source: https://en.wikipedia.org/wiki/Convolutional_neural_network

As mentioned above network extracts feature maps in a hierarchical way and the first layers detect low-level features like edges, dots, and very simple patterns. Then going to the higher layers they start to recognize more and more complex patterns that consist of ones recognized in lower ones.

CNNs have many different methods on how to debug and gain some insights on what is going on inside. This many techniques allow visualizing what the network is seeing and what kind of pattern excites the most which neurons. This way we would be able to visualize more or less what is going on during classifying faces:

In the original paper of Artistic Neural Style Transfer we have this schema:

So, in Neural Style Transfer we want to preserve the content of the image but want to apply some chosen painting style. As we can see in the convolutional network some layers take care more about content and some more of style. Style transfer uses this property of the network.

The first version of the algorithm we can call more an optimization problem but it’s not training a network. It used an already-trained network that learned some hierarchical representation of features. In the beginning, the network is fed with a blank canvas. Then we specify some layers as content layers and style layers. When feeding blank image activation of every layer can be measured and then compared to the proper reference image (for style layers to style image and for content layers to the content image). This way we build a loss function that consists of style loss and content loss. By minimizing loss we manipulate pixels of this initially blank image in the direction that it more and more looks the way we want.

Content loss:

L_{content} \ =\ \frac{1}{2} \ \sum _{i,j}( F_{i,j} \ -\ P_{i,j})^{2}
  • F_{i,j} – responses of layer feature maps on the generated image
  • P_{i,j}– responses of layer feature maps on the content image
  • i,j – a layer has N feature maps, each of them of size width x height. P_{i,j} means the activation of i-th feature map at position j-th in the layer

Style loss:

Here the first step is to build style representation. It computes correlations between responses of different feature maps. These feature correlations are captured by creating a Gram matrix:

G^{l}_{ij} \ =\ \sum _{k} F^{l}_{ik} F^{l}_{kj}

Generating an image that matches the desired style is done by minimizing the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated:

E_{l} \ =\ \frac{1}{4N^{2}_{l} M^{2}_{l}}\sum _{ij}\left( G^{l}_{ij} \ -\ A^{l}_{ij}\right)^{2}
L_{style} \ =\ \sum ^{L}_{l=0} w_{l} E_{l}
  • G^{l}_{ij} – gram matrix of vectorised feature map of generated image
  • A^{l}_{ij} – gram matrix of vectorised feature map of style image
  • N – number of feature maps
  • M – the size of feature map (width x height)
  • w_{l} – weighting factors of the contribution of each layer to the total loss (specified in a paper)

Global loss:

L_{total} \ =\ \alpha L_{content} \ +\ \beta L_{style}

The global loss looks like this. Generating our image is done by minimizing loss by changing pixels of initially blank image with a standard gradient descent algorithm. This \alpha and \beta weights are kind of interesting because by tweaking them you may control how much of content or style you want to have on the image what will have a big impact on the final result.

To summarise…

This algorithm is beautiful and by my mind brilliantly used the properties of CNNs to create these nice and satisfying results. Here I will leave you a link to Keras implementation of this algorithm. Feel free to experiment a little bit by yourself.

Despite its brilliance, this algorithm is based on an optimization process that takes a lot of time. If you think about mobile users, then you know that they will not wait for example half-hour to have their image processed. But since Gatys paper there have been many improvements to this algorithm. In our case, we will use “Fast Style Transfer” from official PyTorch example repositories that is based on this paper – “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” by Justin Johnson, Alexandre Alahi, and Li Fei-Fei.

Fast Style Transfer

In short, this approach trains a transformation network and creates a loss function using a trained VGG network. The loss function is based on similar principles as in a Gatys approach. But when the transformation network is trained it requires only one forward run to have generated image. It is a vastly faster but still trained model can generate images just for one style. Though there exist better models that can generate images for any given style like this one but for mobile usage it is enough when we train just for one style. The most important is the speed with which it will stylize images.

source: https://arxiv.org/pdf/1603.08155.pdf

So this approach creates loss functions in a similar way as the previous optimization algorithm. By this, I mean that first “Image Transform Net” generates an image and then this image is processed by a trained VGG network. Selected feature maps results are compared with the results of style and content images. Loss functions and gradients are computed and then weights W of Transform Network are updated according to the calculated gradients.

Let’s take a look on the code:

All code for training Style Transfer network and conversion to CoreML format is in this Colab notebook: https://colab.research.google.com/drive/1kq7hXi75oSiqZjqPn2q-rIvl3Fa4st5l#scrollTo=NrqLq9tjx0Bz

First thing is the training data. In Colab we are mounting the G-drive to be able to save trained models in a persistent way. Then is downloaded style image. I used “The Two Fridas” famous painting because it has very specific colors and textures (like for example clouds). The training data set is downloaded into the Colabs memory and it’s Microsofts COCO dataset – the same one as used in the original paper.

Then we have two networks architectures defined:

Than we have “tweak-able” parameters section:

Next, we have a section where we instantiate our networks, datasets, dataset transformations, loss, optimization functions, etc.

And training loop itself:

Converting model

Training will probably take a few hours. Now that we have our model saved let’s convert it on CoreML format.

CoreML tools since 4.x support direct conversion from PyTorch to CoreML. But this functionality is fairly new and in our case, it turned out that one of the operations needed to convert our model is not implemented yet (op ‘reflection_pad2d’). It’s CoreML Tools version 4.0b3 – not stable so it has the right to be like this. But check, because maybe at a time when you read this article this operation might be already implemented. For now, we will do this with the old way: first converting the PyTorch model into ONNX format and from there to CoreML.

PyTorch to ONNX:

Here one small note. With dummy_input we specify input shape that later will have to be respected in Swift code. Our PyTorch model is very flexible and works for any input shapes that we want. With CoreML tools it is possible to specify flexible inputs but by controlling input size you also control the effect of Style Transfer. For example, if you adjusted your network to look good on smaller images what is generally desirable if you want to have a “painting stripes” effect. Then when applied to the high-resolution image this effect of “painting stripes” may disappear and just be not visible for human eyes. Another thing with input size is that than smaller than faster will run on the device when we already build it into the app. So I really recommend to specify it. In-app will be displayed time of running style transfer.

ONNX to CoreML

Building model into iOS app

We will not focus on app code and interface too much. I built a very simple UI and if you want you can access full code on my GitHub repo: https://github.com/marta-kr/StyleTransfer. We won’t focus here on anything but handling .mlmodel inside our app.

First and the most important thing, when you already add model file into project (anywhere) you can access it like this:

Second important thing is that CoreML does not work directly with UIImages. To make the model interact with our data we need to convert the chosen image to CVPixelBuffer. With this task, I really recommend this library called CoreMLHelpers. As the author suggests I copied to the project only functions that I needed and that is converting images to CVPixelBuffer and resizing. In our project, it is located in CoreMLHelpers.swift file.

From this point handling image Style Transfer steps are quite clear

  • Convert to CVPixelBuffer image with a size that the neural network expects
  • Call model.prediction() – this will run model inference on our device
  • Convert the result to UIImage
  • Resize to the size from beginning

And from place where you invoke stylizeImage() handle updating UI:

Summary

And this is it! We trained the model, converted it into the proper format, and used it in the iOS application. By the steps we took I hope that you understand how important it is to know the architecture and pre/post-processing pipelines to be able to interact with the model in the iOS app and to be able to convert the model to the required format. By my mind, you neither have to train your own model, have contact with the author or use ready but really well documented model.