Image super-resolution (SR) aims to reconstruct high resolution
images with both high perceptual quality and low distortion, but
is fundamentally limited by the perception–distortion trade-off.
GAN-based SR methods reduce distortion but still struggle with
realistic fine-grained textures, whereas diffusion-based
approaches synthesize rich details but often deviate from the
input, hallucinating structures and degrading fidelity. This
tension raises a key challenge: how to exploit the powerful
generative priors of diffusion models without sacrificing
fidelity. To address this, we propose SpaSemSR, a
spatial–semantic guided diffusion framework with two
complementary guidances. First,
spatial-grounded textual guidance integrates object-level
spatial cues with semantic prompts, aligning textual and visual
structures to reduce distortion. Second,
semantic-enhanced visual guidance with a multi-encoder
design and semantic degradation constraints unifies multimodal
semantic priors, improving perceptual realism under severe
degradations. These complementary guidances are adaptively fused
into the diffusion process via spatial-semantic attention,
suppressing distortion and hallucination while retaining the
strengths of diffusion models. Extensive experiments on multiple
benchmarks show that SpaSemSR achieves a superior
perception-distortion balance, producing both realistic and
faithful restorations.