# Adacompress: Adaptive compression for online computer vision services

## Contents

## Presented by

Ahmed Hussein Salamah

## Introduction

Big data and deep learning have been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. The image Classification task is one of the most important computer vision tasks which has shown a high dependency on Deep Neural Networks to improve their performance in many applications. Recently, they tend to use different image classification models on the cloud just to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). As the most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG and this compression technique is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs), so to be aligned with HVS the authors have to reconfigure the JPEG while maintaining the same classification accuracy. JPEG is a lossy form of compression meaning some information will be lost though for the benefit of an improved compression ratio.

## Methodology

**Figure 1:**Comparing to the conventional solution, the authors [1] solution can update the compression strategy based on the backend model feedback

One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make it lossless compression as shown in [1, 4]. The authors got motivated to change the JPEG configuration to optimize the uploading rate of different cloud computer vision without considering pre-knowledge of the original model and dataset. In contrast to the authors in [2, 3, 5] which they adjust the JPEG configuration according to retrain the parameters or the structure of the model. They considered the lack of undefined quantization level which decreases the image rate and quality but the deep learning model can still recognize it as shown in [4]. The authors in [1] used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on *RL mechanism*.

The approach is designed based on an interactive training environment which represents any computer vision cloud services, then they needed a tool to evaluate and predict the performance of quantization level on an uploaded image, so they used a deep Q neural network agent. They feed the agent with a reward function which considers two optimization parameters, accuracy and image size. It works as iterative behavior interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they designed an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to control to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam [8] by showing some patterns of each image with its own corresponding quality factor. Each image shows a different response from a deep model to show that images are more sensitive to large smooth areas, while is more robust compression for images with complex textures.

**What is a quantization table?**

Before getting to the quantization table first look a the basic architecture of JPEG's baseline system. This has 4 blocks FDCT (Fast Discrete Cosine Transformation), quantizer, statistical model, and entropy encoder. The FCDT block takes an input image separated into [math] n \times n [/math] blocks and applies a discrete cosine transformation creating DCT terms. These DCT terms are values from a relatively large discrete set that will be then mapped through the process of quantization to a smaller discrete set. This is accomplished with a quantization table at the quantizer block, which is designed to preserve low-frequency information at the cost of the high-frequency information. This preference for low frequency information is made because losing high frequency information isn't as impactful to the image when perceived by a humans visual system.

## Problem Formulation

The authors formulate the problem by referring to the cloud deep learning service as [math] \vec{y}_i = M(x_i)[/math] to predict results list [math] \vec{y}_i [/math] for an input image [math] x_i [/math], and for reference input [math] x \in X_{\rm ref} [/math] the output is [math] \vec{y}_{\rm ref} = M(x_{\rm ref}) [/math]. It is referred [math] \vec{y}_{\rm ref} [/math] as the ground truth label and also [math] \vec{y}_c = M(x_c) [/math] for compressed image [math] x_{c} [/math] with quality factor [math] c [/math].

\begin{align} \tag{1} \label{eq:accuracy}
\mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\
& l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\
& g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\
& d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber
\end{align}

The authors divided the used datasets according to their contextual group [math] X [/math] according to [6] and they compare their results using compression ratio [math] \Delta s = \frac{s_c}{s_{\rm ref}} [/math], where [math]s_{c}[/math] is the compressed size and [math]s_{\rm ref}[/math] is the original size, and accuracy metric [math] \mathcal{A}_c [/math] which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is Deep Q Network (DQN). The challenges of using this approach are: (1) the state space of RL is too large to cover, so it should have more layers and nodes to the neural network which make the DRL agent hard to converge and time-consuming during training; (2) The DRL always start with the random initial state that should start to converge to high reward so it will start the train of the DQN. The authors solve this problem by using a pre-trained small model called MobileNetV2 as a feature extractor [math] \mathcal{E} [/math] for its ability in lightweight and image classification, and it is fixed during training the Q Network [math] \phi [/math]. The last convolution layer of [math] \mathcal{E} [/math] is connected as an input to the Q Network [math]\phi [/math], so by optimizing the parameters of Q network [math] \phi [/math], the RL agent's policy is updated.

## Reinforcement learning framework

This paper [1] described the reinforcement learning problem as [math] \{\mathcal{X}, M\} [/math] to be *emulator environment*, where [math] \mathcal{X} [/math] is defining the contextual information created as an input from the user [math] x [/math] and [math] M [/math] is the backend cloud model. Each RL frame must be defined by *action and state*, the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 10 and the state is feature extractor's output [math] \mathcal{E}(J(\mathcal{X}, c)) [/math], where [math] J(\cdot) [/math] is the JPEG output at specific quantization level [math] c [/math]. They found the optimal quantization level at time [math] t [/math] is [math] c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta) [/math], where [math] Q(\phi(\mathcal{E}(f_t)), c; \theta) [/math] is action-value function, [math] \theta [/math] indicates the parameters of Q network [math] \phi [/math]. In the training stage of RL, the goal is to minimize a loss function [math] L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] [/math] that changes at each iteration [math] i [/math] where [math] s = \mathcal{E}(f_t) [/math] and [math]f_t[/math] is the output of the JPEG, and [math] y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] [/math], where is the target that has a probability distribution [math] \rho(s, c) [/math] over sequences [math] s [/math] at iteration [math] i [/math], [math] r [/math] is the feedback reward and quality level [math] c [/math].

The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output [math] Q(\cdot)[/math] is minimized. As a results of no feedback signal can tell that an episode has finished a condition value [math]T[/math] that satisfies [math] t \geq T_{\rm start} [/math] to guarantee to store enough transitions in the memory buffer [math] D [/math] to train on. To create this transitions for the RL agent, a random trials is randomly collected to observe environment reaction. After fetching some trails from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function [math] L [/math] as shown in Algorithm down mention. Thus, it optimize its actions on a minibatch from [math] \mathcal{D} [/math] to be based on historical optimal experience to train the compression level predictor [math] \phi [/math]. when this trained predictor [math] \phi [/math] is deployed, the RL agent will drive the compression engine with the adaptive quality factor [math] c [/math] correspond to the input image [math] x_{i} [/math].

The used reward function that evaluated the interaction between the agent and environment [math] \{\mathcal{X}, M\} [/math] should address the selected action of quality factor [math] c [/math] to be direct proportion with the accuracy metric [math] \mathcal{A}_c [/math] and inverse proportion compression rate [math] \Delta s = \frac{s_c}{s_{\rm ref}} [/math] that shows [math] R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta[/math], where [math] \alpha [/math] and [math] \beta [/math] to form a linear combination.

**Algroithim :**Training RL agent [math] \phi [/math] in environment [math] \{\mathcal{X}, M\} [/math]

## Running-Estimate-Retain Mechanism

The authors solved the change in the scenery at the inference phase that might cause learning to diverge by introducing **running-estimate-retain mechanism**. They introduced estimator with probability [math] p_{\rm est} [/math] that changes in an adaptive way and it is compared a generated random value [math] \xi \in (0,1) [/math]. As shown in Figure 2, Adacompression is switching between three states in an adaptive way as will be shown in the following sections.

**Figure 2:**State Switching Policy

### Inference State

The **inference state** is running most of the time at which the deployed RL agent is trained and used to predict the compression level [math] c [/math] to be uploaded to the cloud with minimum uploading traffic load. The agent will eventually switch to the estimator stage with probability [math] p_{\rm est} [/math] so it will be robust to any change in the scenery to have a stable accuracy. The [math] p_{\rm est} [/math] is fixed at the inference stage but changes in an adaptive way as a function of accuracy gradient in the next stage. In **estimator state**, there will be a trade off between the objective of reducing upload traffic and the risk of changing the scenery, an accuracy-aware dynamic [math] p'_{\rm est} [/math] is designed to calculate the average accuracy [math] \mathcal{A}_n [/math] after running for defined [math] N [/math] steps according to Eq. \ref{eqn:accuracy_n}.
\begin{align} \tag{2} \label{eqn:accuracy_n}
\bar{\mathcal{A}_n} &=
\begin{cases}
\frac{1}{n}\sum_{i=N-n}^{N} \mathcal{A}_i & \text{ if } N \geq n \\
\frac{1}{n}\sum_{i=1}^{n} \mathcal{A}_i & \text{ if } N < n
\end{cases}
\end{align}

### Estimator State

The **estimator state** is executed when [math] \xi \leq p_{\rm est} [/math] is satisfied , where the uploaded traffic is increased as the both the reference image [math] x_{ref} [/math] and compressed image [math] x_{i} [/math] are uploaded to the cloud to calculate [math] \mathcal{A}_i [/math] based on [math] \vec{y}_{\rm ref} [/math] and [math] \vec{y}_i [/math]. It will be stored in the memory buffer [math] \mathcal{D} [/math] as a transition [math] (\phi_i, c_i, r_i, \mathcal{A}_i) [/math] of trial [math]i[/math]. The estimator will not be anymore suitable for the latest [math]n[/math] step when the average accuracy [math] \bar{\mathcal{A}}_n [/math] is lower than the earliest [math]n[/math] steps of the average [math] \mathcal{A}_0 [/math] in the memory buffer [math] \mathcal{D} [/math]. Consequently, [math] p_{\rm est} [/math] should be changed to higher value to make the estimate stage frequently happened.It is obviously should be a function in the gradient of the average accuracy [math] \bar{\mathcal{A}}_n [/math] in such a way to fell the buffer memory [math] \mathcal{D} [/math] with some transitions to retrain the agent at a lower average accuracy [math] \bar{\mathcal{A}}_n [/math]. The authors formulate [math] p'_{\rm est} = p_{\rm est} + \omega \nabla \bar{\mathcal{A}} [/math] and [math] \omega [/math] is a scaling factor. Initially the estimated probability [math] p_0 [/math] will be a function of [math] p_{\rm est} [/math] in the general form of [math]p_{\rm est} = p_0 + \omega \sum_{i=0}^{N} \nabla \bar{\mathcal{A}_i} [/math].

### Retrain State

In **retrain state**, the RL agent is trained to adapt on the change of the input scenery on the stored transitions in the buffer memory [math] \mathcal{D} [/math]. The retain stage is finished at the recent [math] n [/math] steps when the average reward [math] \bar{r}_n [/math] is higher than a defined [math] r_{th}[/math] by the user. Afterward, a new retraining stage should be prepared by saving new next transitions after flushing the old buffer memory [math] \mathcal{D}[/math]. The authors supported their compression choice for different cloud application environments by providing some insights by introducing a visualization algorithm [8] to some images with their corresponding quality factor [math] c [/math]. The visualization shows that the agent chooses a certain quantization level [math] c [/math] based on the visual textures in the image at the different regions. For an instant, a low-quality factor is selected for the rough central region so there is a smooth area surrounded it but for the surrounding smooth region, the agent chooses a relatively higher quality rather than the central region.

## Results

The authors reported in Figure 3, 3 different cloud services compared to the benchmark images. It is shown that more than the half of the upload size while roughly preserving the top-5 accuracy calculated by using A with an average of 7% proving the efficiency of the design. In Figure 4, it shows the ** inference-estimate-retain ** mechanism as the x-axis indicates steps, while [math] \Delta [/math] mark on [math]x[/math]-axis is reveal as a change in the scenery. In Figure 4, the estimating probability [math] p_{\rm est} [/math] and the accuracy are inversely proportion as the accuracy drops below the initial value the [math] p_{\rm est} [/math] increase adaptive as it considers the accuracy metric [math] \mathcal{A}_c [/math] each action [math] c [/math] making the average accuracy to decrease in the next estimations. At the red vertical line, the scenery started to change and [math]Q[/math] Network start to retrain to adapt the the agent on the current scenery. At retrain stage, the output result is always use from the reference image's prediction label [math] \vec{y}_{\rm ref} [/math].
Also, they plotted the scaled uploading data size of the proposed algorithm and the overhead data size for the benchmark is shown in the inference stage. After the average accuracy became stable and high, the transmission is reduced by decreasing the [math] p_{\rm est} [/math] value. As a result, [math] p_{\rm est} [/math] and [math] \mathcal{A} [/math] will be always equal to 1. During this stage, the uploaded file is more than the conventional benchmark. In the inference stage, the uploaded size is halved as shown in both Figures 3, 4.

**Figure 3:**Different cloud services compared relative to average size and accuracy

**Figure 4:**Scenery change response from AdaCompress Algorithm

## Conclusion

Most of the research focused on modifying the deep learning model instead of dealing with the currently available approaches. The authors succeed in defining the compression level for each uploaded image to decrease the size and maintain the top-5 accuracy in a robust manner even the scenery is changed. In my opinion, Eq. \eqref{eq:accuracy} is not defined well as I found it does not really affect the reward function. Also, they did not use the whole validation set from ImageNet which raises the question of what is the higher file size that they considered from in the mention current set. In addition, if they considered the whole data set, should we expect the same performance for the mechanism.

## Critiques

The authors used a pre-trained model as a feature extractor to select a Quality Factor (QF) for the JPEG. I think what would be missing that they did not report the distribution of each of their span of QFs as it is important to understand which one is expected to contribute more. In my video, I have done one experiment using Inception-V3 to understand if it is possible to get better accuracy. I found that it is possible by using the inception model as a pre-trained model to choose a lower QF, but as well known that the mobile models are shallower than the inception models which make it less complex to run on edge devices. I think it is possible to achieve at least the same accuracy or even more if we replaced the mobile model with the inception. Another point, the authors did not run their approach on a complete database like ImageNet, they only included a part of two different datasets. I know they might have limitations in the available datasets to test like CIFARs, as they are not totally comparable from the resolution perspective for the real online computer vision services work with higher resolutions.

## Source Code

https://github.com/AhmedHussKhalifa/AdaCompress

## References

[1] Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu, “Adacompress: Adaptive compression for online computer vision services,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, pp. 2440–2448, ACM.

[2] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan, “DeepN-JPEG: A deep neural network favorable JPEG-based image compression framework,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 18.

[3] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski, “Faster neural networks straight from jpeg,” in Advances in Neural Information Processing Systems, 2018, pp. 3933–3944.

[4] Kresimir Delac, Mislav Grgic, and Sonja Grgic, “Effects of jpeg and jpeg2000 compression on face recognition,” in Pattern Recognition and Image Analysis, Sameer Singh, Maneesha Singh, Chid Apte, and Petra Perner, Eds., Berlin, Heidelberg, 2005, pp. 136–145, Springer Berlin Heidelberg.

[5] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Towards image understanding from deep compression without decoding,” 2018.

[6] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 2016, MobiSys ’16, pp. 123–136, ACM.

[7] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.

[8] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” CoRR, vol. abs/1610.02391, 2016.