Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

Dan Wang¹ Haiyan Sun² Shan Du³ Z. Jane Wang³
Zhaochong An¹ Serge Belongie¹ Xinrui Cui²

¹University of Copenhagen ²University of North Texas ³University of British Columbia

Abstract

Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception–distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial–semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

Method

We propose SpaSemSR, a Spatial-Semantic guided SR framework that steers diffusion with two complementary forms of guidance: (a) Spatial-grounded textual guidance, which integrates semantic prompts with object-level spatial cues, improving fidelity by aligning text semantics with visual structure; (b) Semantic-enhanced visual guidance, which extracts semantic-enhanced features from degraded LR inputs under semantic degradation constraints, improving perceptual realism. These guidances are fused in (c) the proposed Spatial-Semantic ControlNet via parallel cross-attention and integrated into (d) our Spatial-Semantic guided Diffusion model, where they adaptively modulate the generative prior. By integrating spatial-grounded textual and semantic-enhanced visual guidance, SpaSemSR achieves reconstructions that are both perceptually realistic and faithful to the ground truth, effectively balancing the perception-distortion trade-off.

Results

Visual Comparisons

GAN-based method SpaSemSR

SpaSemSR Diffusion-based method

GAN-based method SpaSemSR

SpaSemSR Diffusion-based method

GAN-based method SpaSemSR

SpaSemSR Diffusion-based method