ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • NExT-GPT: Any-to-Any Multimodal LLM ์ •๋ฆฌ
    ๋…ผ๋ฌธ ์ •๋ฆฌ 2025. 7. 9. 16:39
    1. ์—ฐ๊ตฌ ๋ชฉ์  ๋ฐ ๋ฌธ์ œ ์ •์˜
    2. ๋ชจ๋ธ ๊ตฌ์กฐ (์ „์ฒด ์•„ํ‚คํ…์ฒ˜)
    3. ํ•™์Šต ์ „๋žต
    4. ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ
    5. ๊ฒฐ๋ก  ๋ฐ ํ•ต์‹ฌ ๊ธฐ์—ฌ
    6. ์ด ์—ฐ๊ตฌ์˜ ์˜์˜์™€ ์šฐ๋ฆฌ์—๊ฒŒ ์ฃผ๋Š” ์‹œ์‚ฌ์ 

    1. ์—ฐ๊ตฌ ๋ชฉ์  ๋ฐ ๋ฌธ์ œ ์ •์˜

    ์—ฐ๊ตฌ ๋ชฉํ‘œ

    • NExT-GPT๋Š” ์–ด๋–ค ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋“  ์ž…๋ ฅํ•˜๊ณ  ์–ด๋–ค ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋“  ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋Š”
    • ๋ฒ”์šฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MM-LLM)์„ ์ œ์•ˆํ•œ๋‹ค.
    • ๋ชฉํ‘œ๋Š” ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค์˜ ์กฐํ•ฉ์„ ์ž์œ ๋กญ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š”
    • “Any-to-Any” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‹œ์Šคํ…œ ๊ตฌ์ถ•์ด๋‹ค.

    ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„

    ๊ธฐ์กด MM-LLM์˜ ๊ตฌ์กฐ

    • ๋Œ€๋ถ€๋ถ„์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์€ LLM ์ค‘์‹ฌ + ์–ด๋Œ‘ํ„ฐ ์—ฐ๊ฒฐ ๊ตฌ์กฐ
    • ์ž…๋ ฅ(์ด๋ฏธ์ง€/์˜ค๋””์˜ค ๋“ฑ)์€ ์ธ์‹ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์ถœ๋ ฅ์€ ๊ฑฐ์˜ ํ•ญ์ƒ ํ…์ŠคํŠธ
    • ์˜ˆ์‹œ: BLIP-2, Flamingo, LLaVA, MiniGPT-4 ๋“ฑ

    ์ฃผ์š” ํ•œ๊ณ„์ 

    ํ•ญ๋ชฉ๋‚ด์šฉ

    ์ž…๋ ฅ ํŽธํ–ฅ ํ…์ŠคํŠธ ์™ธ ๋ชจ๋‹ฌ ์ž…๋ ฅ์€ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ถœ๋ ฅ์€ ํ…์ŠคํŠธ๋กœ๋งŒ ์ œํ•œ
    ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ ์™ธ๋ถ€ ํˆด์„ ํ˜ธ์ถœํ•˜๋Š” ๋ฐฉ์‹ → ์ •๋ณด ์ „์ด ์‹œ ๋…ธ์ด์ฆˆ/์—๋Ÿฌ ๋ฐœ์ƒ
    ํ•™์Šต ์ œํ•œ ์ „์ฒด ์‹œ์Šคํ…œ์ด end-to-end๋กœ ํ•™์Šต๋˜์ง€ ์•Š์•„ ์ถ”๋ก  ๋Šฅ๋ ฅ ์•ฝํ•จ
    ์œ ์—ฐ์„ฑ ๋ถ€์กฑ ๋ชจ๋‹ฌ ์ „ํ™˜์ด๋‚˜ ์กฐํ•ฉ์ด ์ œํ•œ์ , ์‹ค์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค ๋ฐ˜์˜ ์–ด๋ ค์›€

    ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ํ•ต์‹ฌ ๋ฌธ์ œ

    • ์–ด๋–ค ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋“  ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ๊ณ ,์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์œ ์—ฐํ•œ AI ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„๊นŒ?
    • ์ง€์‹œ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋Š”
    • ์ฆ‰, ์ž…๋ ฅ/์ถœ๋ ฅ ๋ชจ๋‘ ์ž์œ ๋กœ์šด Any-to-Any ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด ๋ฐ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•œ
    • end-to-end ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์ด ๋ณธ ๋…ผ๋ฌธ์˜ ์ค‘์‹ฌ ๊ณผ์ œ๋‹ค.

    2. ๋ชจ๋ธ ๊ตฌ์กฐ (์ „์ฒด ์•„ํ‚คํ…์ฒ˜)

    ์ „์ฒด ๊ตฌ์กฐ ๊ฐœ์š”

    NExT-GPT๋Š” ๋‹ค์Œ์˜ 3๋‹จ๊ณ„ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋œ end-to-end any-to-any MM-LLM ์‹œ์Šคํ…œ์ด๋‹ค.

    1. Multimodal Input Encoding (์™ผ์ชฝ)

    • Text: ๋ณ„๋„ ์ธ์ฝ”๋” ์—†์ด ๋ฐ”๋กœ LLM์œผ๋กœ ์ „๋‹ฌ๋œ๋‹ค.
    • Image / Audio / Video:
      • ๊ฐ๊ฐ Image Encoder / Audio Encoder / Video Encoder๋ฅผ ๊ฑฐ์ณ ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉ๋จ
      • โ›ธ๏ธ ํŒŒ๋ž€ ๋ˆˆ์†ก์ด(โ„๏ธ) = ์‚ฌ์ „ํ•™์Šต๋œ frozen ์ƒํƒœ, ํ•™์Šต๋˜์ง€ ์•Š์Œ
    • ์ดํ›„ Input Projection Layer๋ฅผ ํ†ตํ•ด ๊ฐ ๋ชจ๋‹ฌ ํ‘œํ˜„์„ LLM์ด ์ดํ•ด ๊ฐ€๋Šฅํ•œ ์–ธ์–ด ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
      • ๐Ÿ”ฅ ๋ถˆ๊ฝƒ ์•„์ด์ฝ˜ = ํ•™์Šต๋˜๋Š” ๋ถ€๋ถ„

    2. LLM-centric Alignment & Semantic Understanding (์ค‘์•™)

    • LLM (์˜ˆ: Vicuna)์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„๋“ค์„ ๋ฐ›์•„ ์˜๋ฏธ๋ฅผ ํ•ด์„ํ•˜๊ณ  ์ถ”๋ก ์„ ์ˆ˜ํ–‰
    • ๋™์‹œ์— ๋‹ค์Œ์„ ์ƒ์„ฑ:
      • ํ…์ŠคํŠธ ์‘๋‹ต
      • Modality Signal Tokens: ์–ด๋–ค ๋ชจ๋‹ฌ๋กœ ์ถœ๋ ฅํ• ์ง€๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ์‹ ํ˜ธ

    3. Instruction-following Alignment & Multimodal Output Generation (์˜ค๋ฅธ์ชฝ)

    • Modality Signal Tokens๋Š” Output Projection Layer๋ฅผ ๊ฑฐ์ณ ํ•ด๋‹น ๋ชจ๋‹ฌ ๋””์ฝ”๋”๋กœ ์ „๋‹ฌ
    • ๋””์ฝ”๋”๋Š” ๊ฐ๊ฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ญํ• ์„ ์ˆ˜ํ–‰:
      • Image Diffusion → ์ด๋ฏธ์ง€ ์ƒ์„ฑ
      • Audio Diffusion → ์˜ค๋””์˜ค ์ƒ์„ฑ
      • Video Diffusion → ๋น„๋””์˜ค ์ƒ์„ฑ
      • ๋ชจ๋‘ โ„๏ธ frozen ์ƒํƒœ, ํ•™์Šต๋˜์ง€ ์•Š์Œ
    • ์ตœ์ข…์ ์œผ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์š”๊ตฌํ•œ ๋ชจ๋‹ฌ์˜ ์ถœ๋ ฅ์ด ์ƒ์„ฑ๋จ

    ํ•ต์‹ฌ ์š”์•ฝ

    ์˜์—ญ์„ค๋ช…

    ์˜์—ญ ์„ค๋ช…
    ์ž…๋ ฅ ์ธ์ฝ”๋”ฉ ๊ธฐ์กด ์ธ์ฝ”๋” + ํ•™์Šต๋˜๋Š” Projection Layer๋กœ ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ๋ฅผ LLM ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
    ์ค‘์‹ฌ ์ฒ˜๋ฆฌ LLM์ด ๋ชจ๋“  ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์˜๋ฏธ ์ดํ•ด ๋ฐ ์ƒ์„ฑ ๋ฐฉํ–ฅ ๊ฒฐ์ •
    ์ถœ๋ ฅ ์ƒ์„ฑ ๋””ํ“จ์ „ ๋””์ฝ”๋”๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ถœ๋ ฅ ์ƒ์„ฑ, Projection๋งŒ fine-tuning ๋จ
    ํ•™์Šต ๊ตฌ์กฐ ์ „์ฒด ๋ชจ๋ธ ์ค‘ ํ•™์Šต๋˜๋Š” ๋ถ€๋ถ„์€ ๐Ÿ”ฅ๋กœ ํ‘œ์‹œ๋œ Projection Layer ๋ฟ (๊ฒฝ๋Ÿ‰ ๊ตฌ์กฐ)

    ๐Ÿ“Œ ์™œ ์ค‘์š”ํ•œ๊ฐ€?

    ์ด ๊ตฌ์กฐ๋Š” ๋‹จ์ˆœํ•œ ๋ชจ๋‹ฌ ์—ฐ๋™์ด ์•„๋‹ˆ๋ผ, LLM ์ค‘์‹ฌ์˜ ์˜๋ฏธ ์ฒ˜๋ฆฌ ๋ฐ ์ƒ์„ฑ ์ œ์–ด๋ฅผ ํ†ตํ•ด ์™„์ „ํ•œ Any-to-Any ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ๋ฆ„์„ end-to-end๋กœ ํ†ตํ•ฉํ•œ๋‹ค. ๋˜ํ•œ, ๋Œ€๋ถ€๋ถ„์˜ ๊ตฌ์„ฑ์š”์†Œ๋Š” frozen ์ƒํƒœ์ด๋ฏ€๋กœ, ๋‚ฎ์€ ๋น„์šฉ์œผ๋กœ ๊ณ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์„ ์ œ๊ณตํ•œ๋‹ค.


    3. ํ•™์Šต ์ „๋žต

    NExT-GPT๋Š” ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด ์‹œ์Šคํ…œ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ „๋žต๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ ์ž์›์œผ๋กœ ๊ณ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

    ๐Ÿ”น 1. ์‚ฌ์ „ ํ•™์Šต๋œ ๊ณ ์„ฑ๋Šฅ ์ธ์ฝ”๋” ๋ฐ ๋””์ฝ”๋” ์žฌ์‚ฌ์šฉ

    • CLIP, ImageBind, Stable Diffusion ๋“ฑ ๊ธฐ์กด์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ๋œ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ํ™œ์šฉํ•จ.
    • ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , ๊ธฐ์กด ์ž์›์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ ์ ˆ๊ฐ.
    • ๋‹ค์–‘ํ•œ modality ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ํ™•๋ณด.

    ๐Ÿ”น 2. Off-the-shelf ํŒŒ๋ผ๋ฏธํ„ฐ ํ™œ์šฉ ๋ฐ cold-start ๋ฌธ์ œ ํšŒํ”ผ

    • ๊ธฐ์กด ๋ชจ๋ธ์˜ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์žฌํ™œ์šฉํ•จ์œผ๋กœ์จ cold-start(์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋กœ๋ถ€ํ„ฐ ํ•™์Šต ์‹œ์ž‘)์˜ ๋น„ํšจ์œจ์„ฑ ์ œ๊ฑฐ.
    • ์ด๋Š” ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ํ™•์žฅ์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๋Š” ์ „๋žต์ž„.

    ๐Ÿ”น 3. ์ตœ์†Œ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ๋ฏธ์„ธ ์กฐ์ • (Fine-tuning)

    • ์ „์ฒด ๋ชจ๋ธ ์ค‘ Input/Output Projection Layer๋งŒ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋ฉฐ, ๋‚˜๋จธ์ง€ ์ธ์ฝ”๋”·๋””์ฝ”๋”·LLM์€ ๋ชจ๋‘ ๊ณ ์ •(frozen).
    • ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ ์•ฝ 1%๋งŒ ํ•™์Šต์— ์‚ฌ์šฉ๋˜์–ด ํ•™์Šต ์ž์›์„ ์ตœ์†Œํ™”.
    • ๋ฏธ์„ธ์กฐ์ •๋œ projection layer๊ฐ€ modality ๊ฐ„ feature alignment๋ฅผ ๋‹ด๋‹นํ•จ.

    ๐Ÿ”น 4. ์–‘๋ฐฉํ–ฅ ์ •๋ ฌ ํ•™์Šต

    ์ธ์ฝ”๋”ฉ ์ •๋ ฌ: LLM-centric Alignment

    • ImageBind์™€ ๊ฐ™์€ ์ธ์ฝ”๋”์˜ patch-level feature๋ฅผ ๊ฐœ๋… ๋‹จ์œ„ concept token์œผ๋กœ ๊ทธ๋ฃนํ•‘ํ•˜์—ฌ LLM์ด ์ดํ•ด ๊ฐ€๋Šฅํ•œ ์–ธ์–ด ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•จ.
    • ์ด๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ LLM๊ณผ ์‹œ๊ฐ/์Œ์„ฑ/์˜์ƒ ํŠน์ง• ๊ฐ„ ์˜๋ฏธ ์ •๋ ฌ์„ ์ˆ˜ํ–‰.

    ์ „์ฒด ํ๋ฆ„ ์š”์•ฝ

    ์ž…๋ ฅ (Image, Audio, Video)  Encoder → Patch Representation → Concept Token ๋ณ€ํ™˜ → LLM → Caption ์ƒ์„ฑ → ์ •๋‹ต๊ณผ Cross-Entropy ๋น„๊ต๋กœ ํ•™์Šต


    ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ ์„ค๋ช…

    •  

    1. Image / Audio / Video

    • ์ž…๋ ฅ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ 3์ข…: ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค

    2. Encoder (์ด๋ฏธ์ง€/์˜ค๋””์˜ค/๋น„๋””์˜ค ์ธ์ฝ”๋”)

    • ์‚ฌ์ „ํ•™์Šต๋œ ์ธ์ฝ”๋” (์˜ˆ: CLIP, ImageBind, HuBERT ๋“ฑ)
    • โ„๏ธ ํ‘œ์‹œ = ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ณ ์ •๋จ (Frozen)

    3. Patch Representation

    • ์ธ์ฝ”๋” ์ถœ๋ ฅ์€ ๊ฐ ์ž…๋ ฅ์„ patch ๋‹จ์œ„ grid feature๋กœ ๋ถ„ํ• ํ•œ ํ‘œํ˜„์ด๋‹ค.
      • ์˜ˆ: ์ด๋ฏธ์ง€ → 16×16 ํŒจ์น˜ → ๊ฐ ํŒจ์น˜์˜ ๋ฒกํ„ฐ ํ‘œํ˜„

    4. Input Projection ๋ชจ๋“ˆ

    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Transformer Layer์™€ Grouping Block์œผ๋กœ ๊ตฌ์„ฑ๋จ ๐Ÿ”ฅ
    • ์ด ๋ชจ๋“ˆ์€ patch feature๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฒ˜๋ฆฌ:
      • Transformer Layer: ๊ฐ patch ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
      • Grouping Block: patch๋“ค์„ ๊ฐœ๋… ๋‹จ์œ„(semantic token)๋กœ ์ง‘๊ณ„
      • Concept Token Representation: ์ตœ์ข…์ ์œผ๋กœ LLM์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐœ๋… ๋‹จ์œ„ ํ‘œํ˜„

    5. Concept Image/Audio/Video Representation

    • ์„ธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„๋กœ LLM์— ๋“ค์–ด๊ฐ€๋Š” ๊ฐœ๋… ํ‘œํ˜„ ๋ฒกํ„ฐ

    6. LLM (์˜ˆ: Vicuna-7B)

    • โ„๏ธ Frozen ์ƒํƒœ
    • ์ž…๋ ฅ๋œ ๊ฐœ๋… ํ‘œํ˜„์„ ๋ฐ”ํƒ•์œผ๋กœ Image / Audio / Video Caption์„ ์ƒ์„ฑ

    7. Caption ์ƒ์„ฑ → Cross Entropy Loss

    • ์ƒ์„ฑ๋œ ์บก์…˜๊ณผ Ground Truth(์ •๋‹ต ์บก์…˜)์„ ๋น„๊ตํ•˜์—ฌ ํ•™์Šต
    • ๋น„๊ต๋Š” Cross-Entropy Loss๋กœ ๊ณ„์‚ฐ๋˜์–ด Input Projection ๋ ˆ์ด์–ด๋“ค์„ ์—…๋ฐ์ดํŠธ

     

    ๐ŸŽฏ ํ•™์Šต ๋ชฉ์ 

    • ์ž…๋ ฅ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค)๋ฅผ ์–ธ์–ด ํ‘œํ˜„(Linguistic Space)์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ ฌ(alignment)ํ•˜์—ฌ,
    • LLM์ด ๋ชจ๋“  ๋ชจ๋‹ฌ์„ ํ…์ŠคํŠธ์ฒ˜๋Ÿผ ์ดํ•ดํ•˜๊ณ  ์‘๋‹ตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ

    ์š”์  ์ •๋ฆฌ

    ํ•ญ๋ชฉ ์„ค๋ช…
    ์ž…๋ ฅ ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค
    ์ค‘๊ฐ„ ํ‘œํ˜„ Patch-level feature → Concept Token Representation
    ํ•ต์‹ฌ ํ•™์Šต ๋Œ€์ƒ Input Projection ๋‚ด๋ถ€์˜ ๐Ÿ”ฅ ๋ ˆ์ด์–ด๋“ค (Transformer + Grouping)
    ์ถœ๋ ฅ LLM์ด ์ƒ์„ฑํ•œ Caption (ํ…์ŠคํŠธ ์„ค๋ช…)
    ์†์‹ค ํ•จ์ˆ˜ Cross-Entropy Loss (์˜ˆ์ธก๋œ ์บก์…˜ vs. ์ •๋‹ต)
    ๊ณ ์ •๋œ ๋ชจ๋“ˆ Encoder, LLM (Vicuna) โ„๏ธ

    ์ด ๊ตฌ์กฐ๋Š” LLM์ด ๊ฐ ๋ชจ๋‹ฌ์˜ ์˜๋ฏธ๋ฅผ ๊ฐœ๋… ์ˆ˜์ค€์—์„œ ์ดํ•ดํ•˜๊ณ  ํ…์ŠคํŠธ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ •๋ ฌํ•˜๋Š” ํ•ต์‹ฌ ๋‹จ๊ณ„์ด๋‹ค. ์ฆ‰, "LLM์—๊ฒŒ ์ด๋ฏธ์ง€/๋น„๋””์˜ค/์˜ค๋””์˜ค๋ฅผ ์–ธ์–ด์ฒ˜๋Ÿผ ๋А๋ผ๊ฒŒ ๋งŒ๋“ ๋‹ค"๊ณ  ์š”์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค.


    ๋””์ฝ”๋”ฉ ์ •๋ ฌ: Instruction-following Alignment

    • LLM์ด ์ƒ์„ฑํ•œ modality-specific signal token์„ ๋””์ฝ”๋”๊ฐ€ ์ดํ•ด ๊ฐ€๋Šฅํ•œ ์กฐ๊ฑด ์ž…๋ ฅ์œผ๋กœ ์ •๋ ฌ.
    • ํ•™์Šต ์‹œ signal token๊ณผ diffusion model์˜ ์กฐ๊ฑด ํ…์ŠคํŠธ ํ‘œํ˜„ ๊ฐ„ representation distance ์ตœ์†Œํ™”.

    ์ „์ฒด ๊ตฌ์กฐ ํ๋ฆ„ ์š”์•ฝ

    LLM → Signal Tokens (Image, Audio, Video)

     Output Projection (Transformer + Linear)

     Diffusion ๋ชจ๋ธ ์กฐ๊ฑด ์ž…๋ ฅ

     Content ์ƒ์„ฑ + Loss ๊ณ„์‚ฐ (Alignment + Denoising)


    ๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ์„ค๋ช…

    1. LLM Output Representation

    • LLM์€ ํ…์ŠคํŠธ ์‘๋‹ต ์™ธ์—๋„ ๋ชจ๋‹ฌ ์ง€์‹œ ํ† ํฐ์„ ์ƒ์„ฑํ•œ๋‹ค:
      • Image Signal Token
      • Audio Signal Token
      • Video Signal Token
    • โ„๏ธ = LLM์€ ๋™๊ฒฐ(frozen) ์ƒํƒœ

    2. Image Output Projection (ํ•™์Šต ๋Œ€์ƒ ๐Ÿ”ฅ)

    • LLM์˜ signal token์„ ๋””ํ“จ์ „ ๋ชจ๋ธ์ด ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” projection ๋ชจ๋“ˆ
    • ๊ตฌ์กฐ: Transformer Encoder + Decoder + Linear Layer
    • ํ•™์Šต ๋Œ€์ƒ = signal token ↔ ๋””ํ“จ์ „ ์กฐ๊ฑด ๊ฐ„ ์˜๋ฏธ์  ์ •๋ ฌ ์ˆ˜ํ–‰

    3. Image Diffusion

    • ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•œ Stable Diffusion ๋ฐฑ๋ณธ ์‚ฌ์šฉ
    • Text Encoder + U-Net๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ (โ„๏ธ Frozen)
    • ์กฐ๊ฑด์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ์ •๋ณด:
      • LLM signal token ํ‘œํ˜„ (projected)
      • ํ…์ŠคํŠธ ์„ค๋ช…

    4. Loss ๊ตฌ์„ฑ (์†์‹ค ํ•จ์ˆ˜)

    1. Caption-alignment Loss
      • LLM์ด ์ƒ์„ฑํ•œ signal token์˜ ํ‘œํ˜„๊ณผ,
      • Diffusion ๋ชจ๋ธ ๋‚ด Text Encoder๊ฐ€ ์ƒ์„ฑํ•œ ์กฐ๊ฑด ํ‘œํ˜„ ์‚ฌ์ด์˜ ํ‘œํ˜„ ๊ฑฐ๋ฆฌ ์ตœ์†Œํ™”
    2. Conditional Latent Denoising Loss
      • ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์‹ค์ œ ํƒ€๊ฒŸ ์ด๋ฏธ์ง€์™€ ์œ ์‚ฌํ•˜๋„๋ก
      • U-Net์˜ latent output์— ๋Œ€ํ•ด ๋””๋…ธ์ด์ง• ์†์‹ค ์ ์šฉ

    ๐ŸŽฏ ํ•ต์‹ฌ ์š”์•ฝ

    ํ•ญ๋ชฉ ์„ค๋ช…
    ๋ชฉ์  LLM์˜ ๋ชจ๋‹ฌ ์ง€์‹œ(signal token)๊ฐ€ ์ƒ์„ฑ ๋ชจ๋ธ์— ์ •ํ™•ํžˆ ๋ฐ˜์˜๋˜๋„๋ก ์ •๋ ฌ ํ•™์Šต
    ์ฃผ์š” ๊ตฌ์„ฑ Output Projection (Transformer ๊ธฐ๋ฐ˜), Signal Token, Diffusion
    ํ•™์Šต ๋ฒ”์œ„ Output Projection Layer๋งŒ ํ•™์Šต ๐Ÿ”ฅ
    ์†์‹ค ๊ตฌ์„ฑ (1) ํ‘œํ˜„ ์ •๋ ฌ (caption-alignment), (2) ์ƒ์„ฑ ํ’ˆ์งˆ ํ–ฅ์ƒ (denoising loss)
    ์ƒ์„ฑ๊ธฐ ๋””ํ“จ์ „ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ๊ธฐ (์ด๋ฏธ์ง€: Stable Diffusion, ๋น„๋””์˜ค: Zeroscope, ์˜ค๋””์˜ค: AudioLDM)

    ๐Ÿ“Œ ์™œ ์ค‘์š”ํ•œ๊ฐ€?

    • ๊ธฐ์กด MM-LLM ์‹œ์Šคํ…œ์€ ํ…์ŠคํŠธ ์ง€์‹œ๋งŒ์œผ๋กœ ๋””์ฝ”๋”๋ฅผ ์ œ์–ดํ–ˆ๋‹ค๋ฉด, NExT-GPT๋Š” ๋ชจ๋‹ฌ๋ณ„ ์‹ ํ˜ธ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ๋””ํ“จ์ „ ์ƒ์„ฑ๊ธฐ๋กœ ์ง์ ‘ ์—ฐ๋™์‹œํ‚ด์œผ๋กœ์จ ์ •๋ฐ€ํ•˜๊ณ  ์œ ์—ฐํ•œ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

    ์ด ๊ตฌ์กฐ ๋•๋ถ„์— NExT-GPT๋Š” ๋‹จ์ˆœํžˆ ํ…์ŠคํŠธ ์‘๋‹ต์„ ๋„˜์–ด์„œ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ์ƒ์„ฑ๊นŒ์ง€ ์ผ๊ด€๋˜๊ณ  ํ†ตํ•ฉ์ ์œผ๋กœ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅํ•œ ์ง„์ •ํ•œ Any-to-Any ๋ชจ๋ธ๋กœ ์ž‘๋™ํ•œ๋‹ค.


    ๐Ÿ”น 5. MosIT ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

    • ๊ธฐ์กด instruction tuning ๋ฐ์ดํ„ฐ๋Š” text ์ค‘์‹ฌ์ด๋ผ ํ•œ๊ณ„๊ฐ€ ์žˆ์–ด, ์ƒˆ๋กœ์šด modality-switching ๋ฐ์ดํ„ฐ์…‹ MosIT ์ง์ ‘ ๊ตฌ์ถ•.
    • ์ด 5,000๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ๋Œ€ํ™” ์˜ˆ์‹œ๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๊ฐ„์˜ ์ „ํ™˜, 3~7ํ„ด์˜ ๋ณต์žกํ•œ ๋Œ€ํ™”, ๋ช…์‹œ์ ·์•”์‹œ์  ์š”์ฒญ, ์ถ”๋ก /๊ณ„ํš/๊ฐ์ • ์‘๋‹ต ๋“ฑ ์ธ๊ฐ„ ์ˆ˜์ค€ ๋Œ€ํ™” ํ๋ฆ„์„ ๋ฐ˜์˜.

    ๐Ÿ”น 6. LoRA ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•œ ๊ฒฝ๋Ÿ‰ํ™” ํ•™์Šต

    • LoRA(Low-Rank Adaptation)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LLM์˜ ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํšจ์œจ์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•จ.
    • ์ด ๋ฐฉ์‹์€ ์—ฐ์‚ฐ ์ž์› ์†Œ๋ชจ๋ฅผ ์ค„์ด๋ฉด์„œ๋„ ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์€ ์œ ์ง€์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ.
    • LoRA๋Š” projection layer ์™ธ์— ์ผ๋ถ€ LLM ๋‚ด๋ถ€ ๋ชจ๋“ˆ์—๋„ ์ ์šฉ๋จ.

    ์ด ๊ทธ๋ฆผ์€ Figure 3: modality-switching instruction tuning์˜ ์ „์ฒด ๊ณผ์ •์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ด๋‹ค.

    NExT-GPT๊ฐ€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ , ๊ทธ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต๋˜๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค€๋‹ค.

    ์ „์ฒด ํ๋ฆ„ ์š”์•ฝ

    1. Input Instructions (์™ผ์ชฝ ํšŒ์ƒ‰ ๋ฐ•์Šค)

    • ์‚ฌ์šฉ์ž์˜ ์ž…๋ ฅ์€ ํ…์ŠคํŠธ ๋‹จ๋… ๋˜๋Š” ํ…์ŠคํŠธ + ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ ์กฐํ•ฉ์ด๋‹ค.
      • ์˜ˆ:
        • “๊ณ ์–‘์ด๊ฐ€ ํ”ผ์•„๋…ธ ์น˜๋Š” ์žฅ๋ฉด ๋ณด์—ฌ์ค˜” → text + image
        • “์ด ์†Œ๋ฆฌ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?” → text + audio
        • “์ด ์˜์ƒ์˜ ์„ค๋ช…์„ ์•Œ๋ ค์ค˜” → text + video

    2. Input Encoding ๋ฐ Projection (์ขŒ์ธก ์ค‘์•™)

    • ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค ์ž…๋ ฅ์€ ๊ฐ๊ฐ์˜ Encoder์—์„œ ํŠน์„ฑ ์ถ”์ถœ ํ›„,
    • Input Projection Layer๋ฅผ ํ†ตํ•ด LLM์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค.
    • Text-only ์ž…๋ ฅ์˜ ๊ฒฝ์šฐ projection ์—†์ด ๋ฐ”๋กœ LLM ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋จ.

    3. LLM + LoRA ๊ธฐ๋ฐ˜ Instruction Tuning

    • LLM์€ ์ง€์‹œ๋ฌธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ…์ŠคํŠธ ์ถœ๋ ฅ๊ณผ ํ•จ๊ป˜
    • ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ์ง€์‹œ๋ฅผ ํฌํ•จํ•œ ํŠน๋ณ„ ํ† ํฐ (e.g. <IMGโ‚€>, <VIDโ‚‚>)์„ ์ƒ์„ฑํ•œ๋‹ค.
    • ์ด๋•Œ LoRA ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด LLM ์ผ๋ถ€๋งŒ ๊ฒฝ๋Ÿ‰ ํ•™์Šต๋œ๋‹ค.
    • (๊ทธ๋ฆผ์—์„œ LLM ๋ธ”๋ก์— ๋ถ™์€ "LoRA ๐Ÿ”ฅ"์ด ์ด๋ฅผ ์˜๋ฏธํ•จ)

    4๏ธโƒฃ LLM Output vs Gold Annotation ๋น„๊ต

    • ์‹ค์ œ ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ + signal token ์‹œํ€€์Šค์™€
    • ์ •๋‹ต ์‹œํ€€์Šค(Gold Annotation)๋ฅผ Cross Entropy Loss๋กœ ๋น„๊ตํ•˜์—ฌ ํ•™์Šต.

    5๏ธโƒฃ Signal Token → Output Projection → Diffusion

    • LLM์ด ์ƒ์„ฑํ•œ signal token ํ‘œํ˜„์€ ๊ฐ modality๋ณ„ Output Projection Layer๋ฅผ ๊ฑฐ์ณ,
    • Diffusion Decoder๋กœ ์ „๋‹ฌ๋œ๋‹ค.
    • ์—ฌ๊ธฐ์„œ ์‹ค์ œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฝ˜ํ…์ธ (์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

    6๏ธโƒฃ ์ƒ์„ฑ ๊ฒฐ๊ณผ ํ‰๊ฐ€ (Generation Loss)

    • ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€/์˜ค๋””์˜ค/๋น„๋””์˜ค์™€
    • ์ •๋‹ต multimodal caption (Gold Annotation)์„ ๋น„๊ตํ•˜์—ฌ Generation Loss๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

    ์š”์  ์ •๋ฆฌ

    ๋‹จ๊ณ„ ๋‚ด์šฉ
    ์ž…๋ ฅ ํ…์ŠคํŠธ + ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ (์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค)
    ๋ชฉํ‘œ ์ง€์‹œ๋ฌธ์— ๋”ฐ๋ผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‘๋‹ต์„ ์ •ํ™•ํžˆ ์ƒ์„ฑ
    LLM ์—ญํ•  ํ…์ŠคํŠธ ์ƒ์„ฑ + ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ์ง€์‹œ(signal token) ์ถœ๋ ฅ
    ํ•™์Šต LoRA ๊ธฐ๋ฐ˜ LLM tuning + Output projection ์ •๋ ฌ
    ์†์‹ค ํ•จ์ˆ˜ Cross Entropy + Generation Loss
    ๊ฒฐ๊ณผ ์œ ์ € ์ง€์‹œ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๋ชจ๋‹ฌ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•œ MM-LLM

    ์ด ๋ฐฉ์‹์€ ๋‹จ์ˆœํžˆ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ "๋ฐ›๊ณ  ํ•ด์„ํ•˜๋Š” ๊ฒƒ"์„ ๋„˜์–ด์„œ, ์‚ฌ์šฉ์ž ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋Šฅ๋™์ ์œผ๋กœ ํ…์ŠคํŠธ/์ด๋ฏธ์ง€/์˜ค๋””์˜ค/๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” AI ์‹œ์Šคํ…œ์„ ๋งŒ๋“œ๋Š” ๋ฐ ํ•ต์‹ฌ์ด ๋œ๋‹ค.


    4. ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ

    NExT-GPT๋Š” ๋ชจ๋‹ฌ ์ธ์‹ ๋Šฅ๋ ฅ(perception)๊ณผ ์ฝ˜ํ…์ธ  ์ƒ์„ฑ ๋Šฅ๋ ฅ(generation) ์–‘์ชฝ์—์„œ ๋ชจ๋‘ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์™€ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ทธ ํšจ๊ณผ๊ฐ€ ์ž…์ฆ๋˜์—ˆ์œผ๋ฉฐ, ์ถ”๊ฐ€์ ์œผ๋กœ ๋ชจ๋“ˆ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์˜ ์˜ํ–ฅ ๋ถ„์„๋„ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค.


    1. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ธ์‹ ์„ฑ๋Šฅ (Multimodal Perception)

    • ์ด๋ฏธ์ง€ ์ธ์‹
      • Image Captioning, Image QA ๋“ฑ์—์„œ SOTA ์ˆ˜์ค€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ.
      • MMBench, SEED-Bench ๋“ฑ ํ‰๊ฐ€ ์ „์šฉ ๋ฒค์น˜๋งˆํฌ์—์„œ๋„ ๋†’์€ ์ •๋‹ต๋ฅ .
    • ๋น„๋””์˜ค ๋ฐ ์˜ค๋””์˜ค ์ธ์‹
      • WebVid-2M (Video), AudioCaps (Audio) ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์—์„œ ์šฐ์ˆ˜ํ•œ ์ดํ•ด ๋ฐ ๋ฌธ์žฅ ์ƒ์„ฑ ์„ฑ๋Šฅ ๋ณด์ž„.
      • LLM ๊ธฐ๋ฐ˜ ์ง์ ‘ ์ƒ์„ฑ์œผ๋กœ ๋ฌธ๋งฅ ํ‘œํ˜„๋ ฅ์ด ๋›ฐ์–ด๋‚จ.

    2. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ์„ฑ๋Šฅ (Multimodal Generation)

    • ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ์ด๋ฏธ์ง€·์˜์ƒ·์˜ค๋””์˜ค ์ƒ์„ฑ ํ’ˆ์งˆ ๋น„๊ต
      • Stable Diffusion (์ด๋ฏธ์ง€), Zeroscope (์˜์ƒ), AudioLDM (์˜ค๋””์˜ค) ํ™œ์šฉ.
      • LLM์ด modality signal token์„ ํ†ตํ•ด ์ง์ ‘ ์ง€์‹œํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ ์ œ์–ด๋ ฅ๊ณผ ํ‘œํ˜„๋ ฅ์—์„œ ์šฐ์ˆ˜ํ•จ.
    • ๋น„๊ต ๋ชจ๋ธ: GILL, Emu, UIO-2XXL, Codi ๋“ฑ
      • ๋Œ€๋ถ€๋ถ„์˜ ๋น„๊ต ๋ชจ๋ธ๋ณด๋‹ค ๋‹ค์–‘ํ•œ modality ์กฐํ•ฉ์„ ์ง€์›ํ•˜๋ฉฐ, zero-shot ์ƒํ™ฉ์—์„œ๋„ ์„ฑ๋Šฅ ์œ ์ง€.

    3. ์ •๋Ÿ‰์  ๋ถ„์„ – signal token ์ˆ˜์˜ ์˜ํ–ฅ

    • modality๋ณ„๋กœ ํ•„์š”ํ•œ signal token ๊ฐœ์ˆ˜๊ฐ€ ๋‹ค๋ฆ„:
      • ์ด๋ฏธ์ง€: 4๊ฐœ, ์˜ค๋””์˜ค: 8๊ฐœ, ์˜์ƒ: 24๊ฐœ ์ด์ƒ ํ•„์š”.
    • ๋ฐ์ดํ„ฐ ์–‘๊ณผ ๋””ํ“จ์ „ ๋ชจ๋ธ ๊ฐ•๋„์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ณ€ํ™”.

    4. ๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ์˜ํ–ฅ ์‹คํ—˜

    • Grouping Mechanism ํšจ๊ณผ
      • ๋‹จ์ˆœ Linear Layer → ์„ฑ๋Šฅ ๊ธ‰๊ฐ.
      • Q-Former ๋„์ž… → ์ผ๋ถ€ ๊ฐœ์„ .
      • NExT-GPT์˜ Grouping Mechanism์ด ๊ฐ€์žฅ ํšจ๊ณผ์ .
    • Pipeline vs End-to-End ๊ตฌ์กฐ ๋น„๊ต
      • ์‚ฌ๋žŒ์ด ํŒ๋‹จํ•œ instruction-following, ํ•ฉ๋ฆฌ์„ฑ, ์ƒ์„ฑ ํ’ˆ์งˆ ํ‰๊ฐ€์—์„œ end-to-end ๊ตฌ์กฐ๊ฐ€ ์›”๋“ฑํžˆ ์šฐ์ˆ˜.

    5. ์ •์„ฑ์  ๋ถ„์„ (Qualitative Analysis)

    • ์ง๊ด€์  ์˜ˆ์‹œ ์ œ๊ณต:
      • ์˜์ƒ์—์„œ ๋น„์ •์ƒ ํ–‰๋™ ๊ฐ์ง€ ํ›„ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€์™€ ์˜ค๋””์˜ค ์ƒ์„ฑ.
      • ์‚ฌ์šฉ์ž์˜ ๊ฐ์ •์„ ๊ฐ์ง€ํ•˜๊ณ  ์œ„๋กœ์šฉ ์˜์ƒ ์ž๋™ ์ƒ์„ฑ (e.g., ๊ฐ•์•„์ง€ ์˜์ƒ).
      • ํ”„๋ ˆ์  ํ…Œ์ด์…˜ ์ค€๋น„ ์‹œ ์‹œ๊ฐ ์ž๋ฃŒ + ์š”์•ฝ ํŒ ์ƒ์„ฑ.
    • Implicit Instruction ์ดํ•ด
      • ๋ช…ํ™•ํ•œ ์ง€์‹œ๊ฐ€ ์—†์–ด๋„ ์‚ฌ์šฉ์ž ๊ฐ์ •์ด๋‚˜ ๋ชฉ์ ์„ ํŒŒ์•…ํ•ด ์ ์ ˆํ•œ modality ์„ ํƒ ๋ฐ ์ƒ์„ฑ ์ˆ˜ํ–‰.

    5. ๊ฒฐ๋ก  ๋ฐ ํ•ต์‹ฌ ๊ธฐ์—ฌ

    NExT-GPT๋Š” end-to-end ๊ตฌ์กฐ์˜ ๋ฒ”์šฉ any-to-any ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์œผ๋กœ, ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค๋ฅผ ์ž์œ ๋กญ๊ฒŒ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ์‹œ์Šคํ…œ์ด๋‹ค. ๊ธฐ์กด์˜ ํŒŒ์ดํ”„๋ผ์ธ ๋ฐฉ์‹์ด ๊ฐ€์ง„ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ๊ฐ€์ง„๋‹ค:

    • ๋‹ค์–‘ํ•œ modality๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๋ชจ๋“ˆํ˜• ๊ตฌ์กฐ
    • ๊ธฐ์กด์˜ ๊ณ ์„ฑ๋Šฅ encoder·decoder ์žฌํ™œ์šฉ์œผ๋กœ ํ•™์Šต ๋น„์šฉ ์ตœ์†Œํ™”
    • ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 1%๋งŒ ํ•™์Šตํ•˜๋Š” lightweight ์ „๋žต
    • ๊ณ ํ’ˆ์งˆ instruction tuning dataset (MosIT) ๊ตฌ์ถ• ๋ฐ ํ™œ์šฉ
    • ๋ณต์žกํ•œ cross-modal reasoning๊ณผ generation ๊ฐ€๋Šฅ

    ์ฃผ์š” ๊ธฐ์—ฌ

    1. ์ตœ์ดˆ์˜ ๋ฒ”์šฉ any-to-any MM-LLM ์ œ์•ˆ
      • ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค๋ฅผ ์ž์œ ๋กญ๊ฒŒ ์ธ์‹ํ•˜๊ณ  ์ƒ์„ฑ ๊ฐ€๋Šฅ.
      • LLM ๊ธฐ๋ฐ˜์œผ๋กœ reasoning ๋Šฅ๋ ฅ์„ ๋‚ด์žฅํ•˜์—ฌ ์‚ฌ๋žŒ๊ณผ ์œ ์‚ฌํ•œ ์งˆ์˜์‘๋‹ต ์ˆ˜ํ–‰.
    2. ๊ฒฝ๋Ÿ‰ ์ •๋ ฌ ํ•™์Šต ๊ธฐ๋ฒ• ๋„์ž…
      • ์ธ์ฝ”๋”ฉ ์ธก: LLM-centric multimodal alignment
      • ๋””์ฝ”๋”ฉ ์ธก: Instruction-following alignment
      • ์ „์ฒด ์‹œ์Šคํ…œ์˜ ๋‹จ 1%๋งŒ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ณ ํšจ์œจ ๊ตฌ์กฐ
    3. MosIT: ๊ณ ํ’ˆ์งˆ modality-switching instruction ํŠœ๋‹ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
      • 5,000๊ฐœ ์ด์ƒ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ™” ์ƒ˜ํ”Œ ์ˆ˜๋™ ์ƒ์„ฑ ๋ฐ ๊ฒ€์ˆ˜
      • ๋‹ค์–‘ํ•œ topic๊ณผ modality ์กฐํ•ฉ, multi-turn ๋Œ€ํ™”, implicit ๋ช…๋ น๊นŒ์ง€ ์ปค๋ฒ„

    6. ์ด ์—ฐ๊ตฌ์˜ ์˜์˜์™€ ์šฐ๋ฆฌ์—๊ฒŒ ์ฃผ๋Š” ์‹œ์‚ฌ์ 

    ์—ฐ๊ตฌ์˜ ์˜์˜

    • ๋ฒ”์šฉ any-to-any ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์˜ ์ฒซ ๊ตฌํ˜„ ์‚ฌ๋ก€
    • → ๋‹ค์–‘ํ•œ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ์ž์œ ๋กญ๊ฒŒ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ์ธ๊ฐ„ ์œ ์‚ฌ ์ธ๊ณต์ง€๋Šฅ์˜ ์ดˆ์„
    • ๋ชจ๋“ˆ์‹ ๊ตฌ์กฐ + ๊ฒฝ๋Ÿ‰ ํ•™์Šต ์ „๋žต → ํ™•์žฅ์„ฑ๊ณผ ํšจ์œจ์„ฑ ๋ชจ๋‘ ํ™•๋ณด
    • → LLM์€ ๊ณ ์ •(frozen)๋œ ์ƒํƒœ์—์„œ projection ๊ณ„์ธต๋งŒ ๋ฏธ์„ธ ์กฐ์ • → ํ•™์Šต ๋น„์šฉ ์ ˆ๊ฐ
    • ๊ธฐ์กด์˜ ํŒŒ์ดํ”„๋ผ์ธ ๋ฐฉ์‹์˜ ํ•œ๊ณ„ ๊ทน๋ณต (๋น„์—ฐ์†, ์˜ค๋ฅ˜ ๋ˆ„์ , ์ถ”๋ก ๋ ฅ ๋ถ€์กฑ)
    • → end-to-end ๋ฐฉ์‹์˜ ํ†ตํ•ฉ ํ•™์Šต ๊ตฌ์กฐ๋กœ ์˜๋ฏธ ์žˆ๋Š” ๊ฐœ์„ 

    ์šฐ๋ฆฌ ํ”„๋กœ์ ํŠธ์™€์˜ ์ง์ ‘์  ์—ฐ๊ฒฐ

    • ์šฐ๋ฆฌ๋Š” “๋ชจ๋ธ์— ์ƒˆ๋กœ์šด modality ์ถ”๊ฐ€” ์‹คํ—˜์„ ๋‹ด๋‹น
    • → NExT-GPT์˜ ๊ตฌ์กฐ๋Š” ์ƒˆ modality๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ ํ•ฉํ•œ ์„ค๊ณ„
    • ์ƒˆ๋กœ์šด modal์„ ๋„ฃ์„ ๋•Œ ํ•ด์•ผ ํ•  ํ•ต์‹ฌ ์ž‘์—…:
      1. ์ƒˆ๋กœ์šด modality์— ๋งž๋Š” Encoder ์„ ํƒ or ๊ตฌ์ถ•
      2. Input projection layer ํ•™์Šต
      3. Modality signal token ์ •์˜ ๋ฐ ํ•™์Šต
      4. (ํ•„์š” ์‹œ) Output projection + diffusion decoder ์—ฐ๊ฒฐ
    • ๐Ÿ’ก ์ฆ‰, NExT-GPT์˜ ๊ตฌ์กฐ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์‹คํ—˜์„ ์„ค๊ณ„ํ•˜๊ณ  ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ง์ ‘์ ์ธ ๊ฐ€์ด๋“œ๋ผ์ธ ์—ญํ• 

    '๋…ผ๋ฌธ ์ •๋ฆฌ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    GroundingDINO ์ •๋ฆฌ  (4) 2025.07.09

    ๋Œ“๊ธ€

Designed by Tistory.