Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

Dan Wang1   Haiyan Sun2   Shan Du3   Z. Jane Wang3
Zhaochong An1   Serge Belongie1   Xinrui Cui2
1University of Copenhagen    2University of North Texas    3University of British Columbia
LR
HR
Teaser

Perception-distortion trade-off in GAN-based, Diffusion-based, and Ours: GAN-based methods reduce distortion but produce blurry textures, while diffusion-based methods generate perceptually sharp yet hallucinated details. By integrating spatial-grounded textual guidance, SpaSemSR improves reconstruction fidelity (PSNR, SSIM in (a)), while semantic-enhanced visual guidance enhances perceptual quality (CLIP-IQA, MUSIQ, MANIQA in (b)), resulting in a better perception-distortion trade-off compared with GAN-based (c) and diffusion-based models (d).

Abstract

Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception–distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial–semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

Method

We propose SpaSemSR, a Spatial-Semantic guided SR framework that steers diffusion with two complementary forms of guidance: (a) Spatial-grounded textual guidance, which integrates semantic prompts with object-level spatial cues, improving fidelity by aligning text semantics with visual structure; (b) Semantic-enhanced visual guidance, which extracts semantic-enhanced features from degraded LR inputs under semantic degradation constraints, improving perceptual realism. These guidances are fused in (c) the proposed Spatial-Semantic ControlNet via parallel cross-attention and integrated into (d) our Spatial-Semantic guided Diffusion model, where they adaptively modulate the generative prior. By integrating spatial-grounded textual and semantic-enhanced visual guidance, SpaSemSR achieves reconstructions that are both perceptually realistic and faithful to the ground truth, effectively balancing the perception-distortion trade-off.

Method Overview

Results

Results Table 1 Results Table 2

Visual Comparisons

GAN-based method SpaSemSR
RealESRGAN Nikon_046
Ours Nikon_046
SpaSemSR Diffusion-based method
Ours Nikon_046
XPSR Nikon_046
GAN-based method SpaSemSR
RealESRGAN Nikon_050
Ours Nikon_050
SpaSemSR Diffusion-based method
Ours Nikon_050
XPSR Nikon_050
GAN-based method SpaSemSR
RealESRGAN panasonic_57
Ours panasonic_57
SpaSemSR Diffusion-based method
Ours panasonic_57
XPSR panasonic_57