Chapter 15 - Sound Design
Steven Spielberg
Great visuals feel hollow without sound. In AI video generation, every whisper of wind, every bass drop, every distant siren you describe in your prompt becomes part of the emotional payload that lands in the viewer’s ear. This chapter gives you the vocabulary and structure to design immersive, story-driven sound prompts. Note that the following keywords are targeted to control AI sound models, as currently only a limited number of AI video models support sound generation. As advanced prompt engineering techniques for sound generation [18] are on the rise, you can expect video models to incorporate them in the near future [19].
Quick-start syntax
Structure for sound prompts:
sound [type][intensity][position][texture][temporal_shape]
The type (sound-type) is mandatory, all other fields are optional.
Primary sound types
The first part of a sound prompt is called the sonic anchor. The sonic anchor is the primary sound type that defines how the audience experiences the moment. The dialogue keyword is a sonic anchor. The music keyword is a sonic anchor. There are other sonic anchors such as foley, ambience and silence. Sonic anchors sometimes are called atoms. Music can serve as a background tool to define mood, while sound effects can act as punctuation marks, sharpening beats of action or surprise. Silence is its own powerful type, a deliberate negative space that heightens anticipation or forces attention onto the image.
Use the following keywords to adjust the primary sound types:

