Learned Image Compression

Information Theory (Winter 2020)
EE 276, Information Theory | Project Final Report 

Group Members

Yanpei Tian, Helena Huang, Boxiao Pan


Image compression intends to reduce the size of an image so as to save the cost during transmission as well as storage. There are mainly two types of image compression, namely lossless compression and lossy compression. Traditional codecs include PNG (lossless) and JPEG (lossy). Modern approaches apply deep learning on this task, i.e. learned image compression. The general pipeline includes an autoencoder architecture, where an encoder first compresses the input image into a compact representation, which is subsequently fed into a decoder to reconstruct the input image. The learning objective is thus to minimize the difference between the reconstructed image and the original image. In this project, we aim at exploring different ways to implement this framework as well as some implications behind the notion of “learned compression”.

Literature Review

Full Resolution Image Compression with Recurrent Neural Networks [6]: This project is built on top of ​Variable Rate Image Compression With Recurrent Neural Networks [2], which shows that it is possible to train a single RNN and achieve better-than-current image compression schemes at a fixed output size. The authors incorporate the idea of “entropy encoder” into this project to make it achieve competitive compression rates on images of arbitrary size.

Lossy image compression with compressive autoencoders [3]: In this paper, the author introduced an effective way of dealing with non-differentiability in training autoencoders for lossy image compression. They proposed three alternatives to the un-differentiable quantization step: 1. Replace gradient of quantization by the gradient of identity function, effectively passing gradients without modification from the decoder to the encoder; 2. Additive uniform noise; 3. Stochastic rounding operations;

Entropy Encoding in Wavelet Image Compression [4]: The author reviews several entropy encoding schemes such as Huffman coding and arithmetic coding, providing detailed mathematical bases for the encoding schemes and their applications.


We propose to use the MNIST dataset as the starting point. ​The ​MNIST database​ [6] (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. It consists of 60000 handwritten digits ranging from 0 to 9, each of size 28*28. The testing set has 10000 similarly constructed images.

We will also use the dataset CIFAR-10, which consists of 10 classes of 32×32 color images. The classes include airplane, automobile, bird etc. There are 50000 images in the training set and 10000 in the test set.


The training procedure is shown in Fig. 1. Specifically, the loss consists of two components: the reconstruction loss between the original and reconstructed image, as well as the regularization loss which penalizes the size of the compressed image from being too large.


To solidify our idea, we first try to implement an autoencoder (without the entropy module) to compress the MNIST handwritten digit dataset. Two model architectures are used: 1. A three layer densely connected neural network (222,384 total parameters); 2. CNN with 3 convolutional layers and 2 subsequent densely connected layers (13,289 total parameters).

Model Details: 1. The Densely connected neural network uses 3 hidden layers with 128, 64, 32 hidden units respectively followed by ‘ReLU’ activation; 2. The CNN model uses three layers of convolutional layers, followed by two other densely connected layers, all with ‘ReLU’ activation.

Motivated by the success of U-Net on image reconstruction, we further tested with U-Net whose structure is shown in Figure 3. We used a slightly different configuration than the specific one shown in Figure 3. Concretely, we used 4 convolution (and de-convolution) in encoder (and decoder) with kernel size 3 and stride 1, each followed by 2*2 MaxPooling (or UpSampling in decoder). The channel number of each layer is 16, 32, 64, 128 (and reverse for decoder) for encoder, respectively. We use ReLU as the activation function for all the intermediate layers, and Sigmoid for the last layer. We also use He initialization for weight initialization. To reduce overfitting, we also apply dropout for every layer with drop out probability of 0.1.

Performance Evaluation: 1. The densely connected model can achieve a compression ratio of 784/32 (using 32-dimensional vector to represent the 28 by 28 grayscale image); 2. The CNN model can achieve a compression ratio of 784/8 (using 8-dimensional vector to represent the 28 by 28 grayscale image). Both models achieve around 0.01 training loss measured by L2 loss and can generate fairly accurate reconstructions.


We will evaluate the performance of our project based on these three metrics: compression ratio, loss function, and ​human perceptual rating.

Compression Ratio: Represented as bits per pixel (bpp), the number of bits needed in the compressed image to represent 1 pixel in the original image;

Loss Function: As far as our literature review goes, there are no such metric capable of correlating with human raters for all types of distortions. Current state-of-the-art models utilize L_p norms as the optimization metric.

Human Perceptual Rating: The ultimate goal for all image compression schemes is to create a faithful but smaller representation of the original image. Therefore, the reconstruction of the compressed image needs to be similar to the original image according to human perception.

CIFAR-10 Result Analysis

Compression ratio: We use a symmetrical architecture between the encoder and decoder. Each of the encoder and decoder have 4 convolutional layers. With the current implementation, we can achieve a compression ratio of 6.

Loss: During the training, the loss measured by L-2 norm converges to 0.008 (with pixel values ranging from 0 to 1).

Reconstruction Result:


As a further foray about how the neural model is learning to do well on the compression task, we use the MNIST dataset to probe the learned patterns in the network.

It is not hard to identify the data are better clustered in the compressed representations. We propose the neural model is learning to to compression by building some knowledge base inside the neural network. During decompression, the model will leverage both its internal knowledge base and the compressed input to generate the reconstruction.

Compare Figure 7 and Figure 6 (left), we can get still more insights about how the compressor is working. The reconstructions shown in Figure 7 are still better clustered than the raw data input, meaning the network is using its internal categorical representation to generate the reconstructions. In the case of MNIST dataset, the internally learned categorical knowledge base is the 0-9 clustering of the compressed image.


In this project, we implemented several neural-network based image compression schemes on MNIST and CIFAR-10 dataset. As a first foray of learned image compression, we achieved reconstruction loss of around 0.01 on both dataset, measured by L2 norm loss. We further study the internal representations learned by the neural-network during training. By performing t-SNE operation on the raw data input, compressed implementation, and reconstruction, we can get a sense of the “classification” effect of the neural image compression scheme.


[1] “Workshop and Challenge on Learned Image Compression.” ​Workshop and Challenge on Learned Image Compression,​ w​ ww.compression.cc/

[2] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar. Variable rate image compression with recurrent neural networks. ICLR 2016, 2016.

[3] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.

[4] Song, M.S.: Entropy encoding in wavelet image compression. In: Representations, Wavelets and Frames A Celebration of the Mathematical Work of Lawrence Baggett, pp. 293–311 (2007)

[5] ​Mwiti, Derrick. “A 2019 Guide to Deep Learning-Based Image Compression.” ​Medium,​ Heartbeat, 25 Oct. 2019, heartbeat.fritz.ai/a-2019-guide-to-deep-learning-based-image-compression-2f5253b4d811.

[6] “MNIST Database.” ​Wikipedia,​ Wikimedia Foundation, 24 Feb. 2020, en.wikipedia.org/wiki/MNIST_database.

[7] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell. Full Resolution Image Compression with Recurrent Neural Networks. arXiv preprint arXiv:1608.05148, 2016.

[8] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention 2015 Oct 5 (pp. 234-241). Springer, Cham.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s