Search is multimodal now
AI systems increasingly read the page, the image, the video transcript, the caption, the filename, and the structured data together. A useful visual with no context is almost invisible.
Treat every important asset as part of the public truth layer. The image should support the page’s claim. The video should have a transcript. The schema should connect the asset to the topic.
Images: make the visual quotable
Use descriptive filenames, accurate alt text, nearby explanatory copy, compressed files, and ImageObject schema where the image is a meaningful asset.
Do not stuff keywords into alt text. Describe what the image proves or shows. “AI visibility scorecard showing low citation coverage for a SaaS brand” is useful. “AI SEO GEO best tool ranking platform” is noise.
Videos: make spoken knowledge extractable
Upload full transcripts, add captions, use chapter timestamps, write a description that states the core answer, and add VideoObject schema on owned pages.
Short clips can work, but only when they carry context: title, overlay, caption, transcript, and a linked source page. Without those, the clip may get engagement without becoming an AI source.
The knowledge package standard
For high-value pages, pair the written answer with one explanatory visual and, where useful, a short video or webinar excerpt. The page becomes easier to cite because it offers text, proof, and media in one coherent package.
Production checklist
For every meaningful image: descriptive filename, accurate alt text, surrounding explanation, compressed file, and schema when appropriate.
For every meaningful video: transcript, captions, chapters, summary, source page, and VideoObject schema. If the spoken content matters, make it readable.