[Skim-read] Image Super-Resolution via Iterative Refinement

Featured Image

Overview

Problem

  • In the field of image super-resolution, existing approaches often suffer from various limitations; e.g., autoregressive models are prohibitively expensive for high-resolution image generation, Normalizing Flows (NFs) and variational autoencoders (VAEs) often yield sub-optimal sample quality, and GANs require carefully designed regularization and optimization tricks to tame optimization instability and model collapse.

Solution

  • Present SR3, an approach to image super-resolution via repeated refinement based on DDPM.

Results

  • The high-frequency information of the image can be well resored compared to other methods.
  • Despite mediocre performance in SSIM and PSNR metrics, visualization and consistency are good.

Diffusion Probabilistic Models

I’ve written a blog about diffusion probabilistic models (DPM). It has the derivation of the basic formulas of the DPM as well as a simple code implementation.

Method

Fig. 1. The forward diffusion process $q$ (left to right) gradually adds
Gaussian noise to the target image. The reverse inference process
$p$ (right to left) iteratively denoises the target image conditioned
on a source image x. (Image source: Saharia et al. 2023)

SR3 is a model obtained by improving on DDPM. Instead of randomly generating images, low resolution images are used as conditions to generate images. The main changes in SR3 are:

  1. The low resolution image is concatenated to the original input (x_t-1) after bicubic interpolation to get a 6-channel tensor as the new input to the DDPM.

    We experimented with more sophisticated methods of conditioning, such as using FiLM (Perez et al. 2018), but we found that the simple concatenation yielded similar generation quality.

  2. Instead of sampling $\bar{\alpha}_t$ directly using timestep $t$ to compute the correlation variable and loss, a random value is sampled from the distribution $\bar{\alpha} \sim p(\bar{\alpha}) = U(\bar{\alpha}_{t-1}, \bar{\alpha}_{t})$. (Section 2.4 in Saharia et al. 2023)
  3. The model receives noise level $\bar{\alpha}_t$ directly instead of timestamp $t$. This allows flexibility in adjusting the noise level and the number of sampling steps during inferring.

Experrimental Study

New metric: Consistency

As a measure of the consistentcy of the superresolution outputs, we compute MSE between the downsampled outputs and the low resolution inputs.

New metric: Classification Accuracy

In the field of low-level vision, metrics often do not comprehensively represent the quality of images. Therefore the effectiveness of low-level models is often evaluated in terms of proxy tasks.

This paper mirror the evalution setup of Zhang et al. (2018) and apply 4$\times$ superresolution models to 56$\times$56 center crops from the validation set of ImageNet.

Quantitative Results

Compared to PULSE (Menon et al. 2020), FSRGAN (Chen et al. 2018), and Regressive models, the results in terms of PSNR and SSIM are relatively average. This is because traditional super-resolution models are typically trained based on PSNR, which SR3 is not. Therefore, it is normal for the metrics to be relatively low. However, the consistency metrics, on the other hand, perform very well.

Table 1. PSNR & SSIM on 16$\times$16 $\to$ 128$\times$128 face superresolution.
Consistency measures MSE ($\times10^{−5}$) between the lowresolution
inputs and the down-sampled super-resolution outputs. (Table
source: Saharia et al. 2023 as a screenshot)
Table 2. Performance comparison between SR3 and Regression
baseline on natural image super-resolution using standard
metrics computed on the ImageNet validation set. (Table
source: Saharia et al. 2023 as a screenshot)

Evaluation of Proxy Task

Table 3. Comparison of classification accuracy scores for 4$\times$ natural
image super-resolution on the first 1K images from the ImageNet
Validation set. (Table source: Saharia et al. 2023 as a screenshot)

Human Evaluation (2AFC)

This paper use a 2-alternative forced-choice (2AFC) paradigm to measure how well humans can discriminate true images from those generated from a model.

Fig. 2. Face super-resolution human fool rates (higher is better,
photo-realistic samples yield a fool rate of 50%). Outputs of 4
models are compared against ground truth. (top) Subjects are
shown low-resolution inputs. (bottom) Inputs are not shown.
(Image source: Saharia et al. 2023)
Fig. 3. ImageNet super-resolution fool rates (higher is better,
photo-realistic samples yield a fool rate of 50%). SR3 and Regression
outputs are compared against ground truth. (top) Subjects are shown
low-resolution inputs. (bottom) Inputs are not shown. (Image source:
Saharia et al. 2023)

Visualization

Fig. 3. Comparison of different methods on the 16$\times$16 $\to$ 128$\times$128 face
super-resolution task. Reference image has not been included because
of privacy concerns. (Image source: Saharia et al. 2023)
Fig. 4. Results of a SR3 model (64$\times$64 $\to$ 512$\times$512), trained on FFHQ, and applied to
images outside of the training set, along with enlarged patches to show finer details.
(Image source: Saharia et al. 2023)

Fig. 4 shows that the image obtained by SR3 has more details (high-frequency information of the image) compared to the regression model.

Summary

SR3 employs a completely novel approach to super-resolution, distinct from previous approaches based on GANs and CNNs. It primarily generates high-resolution images by denoising progressively from low resolution images conditioned on diffusion models. In the experimental section, the PSNR and SSIM metrics show relatively less impressive performance compared to other methods. However, it outperforms the Regression model in terms of FID and IS metrics, which would be more convincing if PULSE and FSRGAN also be evaluated. Personally, I find the consistency metric not very meaningful. Still, its remarkable performance in proxy task compared to the Regression model is worth attention (through there is still a lack of experimental comparisons with PULSE and FSRGAN). The approach of using diffusion models for image super-resolution is effective, and there is potential for further research in the future.

Reference

[1] Chitwan Saharia et al. “Image Super-Resolution via Iterative Refinement.” TPAMI 2023.

[2] Ethan Perez et al. “FiLM: Visual Reasoning with a General Conditioning Layer.” AAAI 2018.

[3] Yulun Zhang et al. “Image Super-Resolution Using Very Deep Residual Channel Attention Networks.” ECCV 2018.

[4] Sachit Menon et al. “PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models.” CVPR 2020.

[5] Yu Chen et al. “FSRNet: End-to-End Learning Face Super-Resolution With Facial Priors.” CVPR 2018.

[6] Kaiming He et al. “Deep residual learning for image recognition.” CVPR 2016.