ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • GroundingDINO ์ •๋ฆฌ
    ๋…ผ๋ฌธ ์ •๋ฆฌ 2025. 7. 9. 16:25
    1. Motivation & Background (์™œ ํ•„์š”ํ•œ๊ฐ€?)
    2. Model Overview (๊ตฌ์กฐ ๋ฐ ํ•ต์‹ฌ ์•„์ด๋””์–ด)
    3. Loss Functions
    4. Experiments & Results
    5. Limitations & Insights
    6. Takeaways

    1. Motivation & Background (์™œ ํ•„์š”ํ•œ๊ฐ€?)

    Open-Set Object Detection์ด๋ž€?

    ๋ฌธ์ œ ์ •์˜: Open-Set Object Detection

    • ๊ธฐ์กด Closed-set Object Detection ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ๋ฒ”์ฃผ์˜ ๊ฐ์ฒด๋งŒ ์ธ์‹ ๊ฐ€๋Šฅ
      • ์˜ˆ: COCO ํ•™์Šต ๋ชจ๋ธ์€ '์ฝ”๋ผ๋ฆฌ'๋ผ๋Š” ํด๋ž˜์Šค๊ฐ€ ์—†์œผ๋ฉด, ์ฝ”๋ผ๋ฆฌ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์—†์Œ
    • ํ•˜์ง€๋งŒ ์‹ค์ œ ์„ธ์ƒ์€ Open-world ํ™˜๊ฒฝ
      • ์–ธ์ œ๋“  ์ƒˆ๋กœ์šด ๊ฐ์ฒด ๋“ฑ์žฅ ๊ฐ€๋Šฅ
      • ์‚ฌ์šฉ์ž ์ •์˜ ์งˆ์˜(์˜ˆ: "red umbrella")์— ๊ธฐ๋ฐ˜ํ•œ ํƒ์ง€๊ฐ€ ํ•„์š”ํ•จ

    ๐Ÿ’ก ๋ชฉํ‘œ: ์ž์—ฐ์–ด๋กœ ์ง€์ •ํ•œ ์ž„์˜์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”

    ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ ํ•œ๊ณ„์  (e.g. GLIP, OV-DETR์˜ ๋ชจ๋‹ฌ ์œตํ•ฉ ๋ฐฉ์‹ ํ•œ๊ณ„)

    • GLIP: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ early fusionํ•˜์—ฌ grounding task๋กœ ํ•™์Šต
      • ๋‹จ์ : ๋ชจ๋‹ฌ ์œตํ•ฉ์ด **๋‹จ์ผ ๋‹จ๊ณ„(Neck)**์—๋งŒ ๊ตญํ•œ๋จ → ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ •๋ ฌ ๋ถ€์กฑ
    • OV-DETR: ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ query๋กœ ์‚ฌ์šฉ (Head๋งŒ ํ…์ŠคํŠธ์™€ ์—ฐ๊ฒฐ)
      • ๋‹จ์ : fusion depth๊ฐ€ ๋‚ฎ์•„ ์„ธ๋ฐ€ํ•œ ์ •๋ ฌ ๋ถ€์กฑ
    • CLIP ๊ธฐ๋ฐ˜ ๋ฐฉ์‹๋“ค: ์ฃผ๋กœ image-text pair ํ•™์Šต → region-level grounding ์„ฑ๋Šฅ ํ•œ๊ณ„

    ๐Ÿ“Œ ์š”์•ฝ: ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ ์–ธ์–ด์™€ ์‹œ๊ฐ ์ •๋ณด ์œตํ•ฉ์ด ๋А์Šจํ•˜๊ฑฐ๋‚˜ ์ œํ•œ์ 

    Grounding DINO๊ฐ€ ๋“ฑ์žฅํ•œ ์ด์œ 

    • ๊ธฐ์กด DETR ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(DINO)์˜ ์žฅ์ ์„ ํ™œ์šฉ:
      • Layer-by-layer ๊ตฌ์กฐ๋กœ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์œตํ•ฉ์„ ์œ ์—ฐํ•˜๊ฒŒ ์„ค๊ณ„ ๊ฐ€๋Šฅ
    • Grounding DINO๋Š” 3๋‹จ๊ณ„ ๋ชจ๋‹ฌ ์œตํ•ฉ(tight fusion) ์ œ์•ˆ:
      1. Feature Enhancer (neck): ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํŠน์„ฑ ์ •๋ ฌ
      2. Language-Guided Query Selection (query): ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฟผ๋ฆฌ ์„ ํƒ
      3. Cross-Modality Decoder (head): ์ตœ์ข… ์˜ˆ์ธก์— ์–‘ ๋ชจ๋‹ฌ ์ •๋ณด ํ†ตํ•ฉ

    ๐ŸŽฏ ํ•ต์‹ฌ ์•„์ด๋””์–ด: ๋‹ค์–‘ํ•œ ๋ ˆ๋ฒจ์—์„œ ์–ธ์–ด-์‹œ๊ฐ ์ •๋ณด๋ฅผ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์ •๋ ฌ(fusion)


    2. Model Overview (๊ตฌ์กฐ ๋ฐ ํ•ต์‹ฌ ์•„์ด๋””์–ด)

    Grounding DINO์˜ ์ „์ฒด ์•„ํ‚คํ…์ฒ˜ ์†Œ๊ฐœ

    Grounding DINO๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๋ชจ๋“ˆ๋ณ„๋กœ ์ ์ง„์ ์œผ๋กœ ์ •๋ ฌ(fusion)ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

    ๐Ÿ–ผ๏ธ Image Backbone

    • ๋ชฉ์ : ์ด๋ฏธ์ง€์—์„œ multi-scale feature ์ถ”์ถœ
    • ์‚ฌ์šฉ ๋ชจ๋ธ: Swin Transformer (Tiny, Large ๋“ฑ)
    • ์ถœ๋ ฅ: ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„(8×, 16×, 32×)์˜ vanilla image features

    ๐Ÿ“ Text Backbone

    • ๋ชฉ์ : ์ž…๋ ฅ ํ…์ŠคํŠธ(์งˆ์˜ ๋ฌธ์žฅ ๋˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ์ด๋ฆ„๋“ค)์—์„œ ์–ธ์–ด ์ž„๋ฒ ๋”ฉ ์ถ”์ถœ
    • ์‚ฌ์šฉ ๋ชจ๋ธ: BERT (HuggingFace BERT-base)
    • ์ถœ๋ ฅ: vanilla text features (๋ฌธ์žฅ or sub-sentence ๋‹จ์œ„)

    ๐Ÿ”— Feature Enhancer

    • ๋ชฉ์ : ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ feature ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ •๋ณด๋ฅผ ์ •๋ ฌ
    • ๊ตฌ์„ฑ:
      • Deformable Self-Attention (์ด๋ฏธ์ง€ ์ „์šฉ) : ๋Œ€๊ทœ๋ชจ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฐœ์„ ๋œ self-attention ๋ฐฉ์‹
      • Text Self-Attention (ํ…์ŠคํŠธ ์ „์šฉ)
      • Text-to-Image & Image-to-Text Cross-Attention (์–‘๋ฐฉํ–ฅ ์ •๋ ฌ)
    • ๊ฒฐ๊ณผ: Cross-modality ์ •๋ ฌ๋œ feature ์ถœ๋ ฅ
    • ํ•ต์‹ฌ ์‚ฌํ•ญ : GLIP์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„, Image-to-Text Cross-Attention, Text-to-Image Cross-Attention ๋‘ ๊ฐ€์ง€ ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ modality๊ฐ„์˜ feature ์ •๋ ฌ์„ ๋„์™€์ค€๋‹ค!

    ๐ŸŽฏLanguage-Guided Query Selection

    • ๋ชฉ์ : ์ด๋ฏธ์ง€ feature ์ค‘ ํ…์ŠคํŠธ์™€ ๊ฐ€์žฅ ๊ด€๋ จ ์žˆ๋Š” ๋ถ€๋ถ„๋งŒ ์„ ํƒํ•ด์„œ ์ฟผ๋ฆฌ๋กœ ์‚ฌ์šฉ
    • ๋ฐฉ์‹:
      • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ feature ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
      • ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ƒ์œ„ N๊ฐœ(image region)๋ฅผ ์ฟผ๋ฆฌ๋กœ ์‚ฌ์šฉ
    • ๊ฒฐ๊ณผ: Cross-Modality Queries ์ƒ์„ฑ
    • ํ•ต์‹ฌ์‚ฌํ•ญ : Grounding DINO๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ๋ณด์ง€ ์•Š๊ณ , ํ…์ŠคํŠธ์™€ ๊ด€๋ จ ์žˆ๋Š” ์˜์—ญ๋งŒ ์„ ํƒํ•œ๋‹ค! ์ด ๋ชจ๋“ˆ ๋•๋ถ„์— ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์˜๋ฏธ ์ •๋ ฌ์ด ์ž˜ ๋˜๊ณ  Open-set/Zero-shot ๊ฐ์ฒด ํƒ์ง€์—์„œ๋„ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š” ํ•ต์‹ฌ ์š”์†Œ๊ฐ€ ๋œ๋‹ค.

    ๐Ÿง  Cross-Modality Decoder

    • ๋ชฉ์ : ์„ ํƒ๋œ ์ฟผ๋ฆฌ๋“ค์„ ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ  bounding box ์˜ˆ์ธก
    • ๊ตฌ์„ฑ:
      • Self-Attention
      • Image Cross-Attention
      • Text Cross-Attention
      • FFN (Feed Forward Network)
    • ์ถœ๋ ฅ: ๊ฐ์ฒด์˜ bounding box์™€ ํ…์ŠคํŠธ(label) ๋งค์นญ
    • ํ•ต์‹ฌ์‚ฌํ•ญ : Grounding DINO์˜ ๋””์ฝ”๋”๋Š” ๊ธฐ์กด DINO ๋””์ฝ”๋”๋ณด๋‹ค ํ•˜๋‚˜์˜ ์ธต์ด ๋” ์žˆ๋‹ค! → ๋ฐ”๋กœ ์ถ”๊ฐ€๋œ Text Cross-Attention Layer๋‹ค. ์ด ์ธต์„ ํ†ตํ•ด ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ์ฟผ๋ฆฌ์— ์ง์ ‘ ์ฃผ์ž…ํ•จ์œผ๋กœ์จ ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ์„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
    • ์™œ ์ค‘์š”ํ•œ๊ฐ€? : ๊ธฐ์กด DETR/DINO๋Š” ํ…์ŠคํŠธ ์ •๋ณด ์—†์ด ์ฟผ๋ฆฌ๋งŒ์œผ๋กœ ์˜ˆ์ธก, Grounding DINO๋Š” ์ฟผ๋ฆฌ ์ž์ฒด๊ฐ€ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์œตํ•ฉ๋œ ์ƒํƒœ์—ฌ์•ผ → Open-set/Referring/Zero-shot๋“ฑ ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ๋ฐœํœ˜ ๊ฐ€๋Šฅ

    ๊ฒฐ๊ณผ

    • Grounding DINO๋Š” (Image, Text) ์Œ์„ ๋ฐ›์•„,
      • ํ•ด๋‹น ์ด๋ฏธ์ง€์—์„œ ์ž…๋ ฅ ํ…์ŠคํŠธ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ์ฒด๋“ค์˜ bounding box์™€ ํ…์ŠคํŠธ label์„ ์Œ(pair)์œผ๋กœ ์ถœ๋ ฅํ•จ.

    ํ•œ ์ค„ ์š”์•ฝ

    Grounding DINO๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ๋ชจ๋“ˆ๋ณ„๋กœ ์ ์ง„์ ์œผ๋กœ ๊นŠ์ด ์ •๋ ฌ์‹œ์ผœ,

    ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์ •ํ™•ํ•˜๊ณ  ์ผ๋ฐ˜ํ™”๋œ open-set ๊ฐ์ฒด ํƒ์ง€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ ๊ตฌ์กฐ์ด๋‹ค.


    Tight Modality Fusion (3๋‹จ๊ณ„ ์œตํ•ฉ: Neck / Query / Head)

    ๐Ÿ” ์™œ ํ•„์š”ํ•œ๊ฐ€?

    ๊ธฐ์กด์˜ Open-set Object Detector๋“ค์€ ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ์ œํ•œ๋œ ์œ„์น˜์—๋งŒ ์ฃผ์ž…ํ•˜๋Š” ๋ฐฉ์‹(GLIP: Neck๋งŒ, OV-DETR: Head๋งŒ)์ด ๋งŽ์•˜์Œ.

    ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์–ธ์–ด-์‹œ๊ฐ ์ •๋ ฌ์˜ ๊นŠ์ด๊ฐ€ ์–•์•„์ ธ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ.

    ๐Ÿ”ง Grounding DINO์˜ ํ•ด๊ฒฐ์ฑ…: 3๋‹จ๊ณ„ ์œตํ•ฉ (Tight Fusion)

    Grounding DINO๋Š” ์–ธ์–ด์™€ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์ด 3๋‹จ๊ณ„์— ๊ฑธ์ณ ๊นŠ๊ฒŒ ์œตํ•ฉํ•จ์œผ๋กœ์จ,

    Cross-Modality Alignment๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜๊ณ ์ž ํ•œ๋‹ค.

    ๐Ÿ“Œ 1๋‹จ๊ณ„: Feature Fusion A (Neck)

    • ์œ„์น˜: Image Backbone → Neck ์‚ฌ์ด
    • ๋‚ด์šฉ:
      • ์ด๋ฏธ์ง€ feature๋ฅผ Text Feature์™€ ์ดˆ๊ธฐ ์ •๋ ฌ
      • Cross-Attention์„ ํ†ตํ•ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฏธ์ง€ ์˜๋ฏธ ์žฌ๊ตฌ์„ฑ
      • GLIP๋„ ์—ฌ๊ธฐ๊นŒ์ง€๋งŒ ์‚ฌ์šฉํ•จ

    ๐Ÿ“Œ 2๋‹จ๊ณ„: Feature Fusion B (Query Initialization)

    • ์œ„์น˜: Neck → Head ์‚ฌ์ด
    • ๋‚ด์šฉ:
      • Language-Guided Query Selection
      • ํ…์ŠคํŠธ์™€ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ด๋ฏธ์ง€ ์˜์—ญ๋งŒ Query๋กœ ์„ ํƒ
      • DETR ๊ณ„์—ด ์ฟผ๋ฆฌ ๊ตฌ์กฐ์™€ ํ˜ธํ™˜
      • OV-DETR๋Š” ์ด ๋‹จ๊ณ„๋งŒ ์‚ฌ์šฉํ•จ

    ๐Ÿ“Œ 3๋‹จ๊ณ„: Feature Fusion C (Head)

    • ์œ„์น˜: Head ๋‚ด๋ถ€ (๋””์ฝ”๋”)
    • ๋‚ด์šฉ:
      • ๋””์ฝ”๋” ๋ ˆ์ด์–ด์—์„œ Image Cross-Attention + Text Cross-Attention ๋™์‹œ ์ ์šฉ
      • ์ฟผ๋ฆฌ๊ฐ€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์–‘์ชฝ์œผ๋กœ๋ถ€ํ„ฐ ์˜๋ฏธ๋ฅผ ์ง€์†์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋ฐ›์Œ

    ๐ŸŽฏ ์ถ”๊ฐ€: Contrastive Loss ์ ์šฉ ์œ„์น˜

    • Loss A: Neck์—์„œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ Feature ์ •๋ ฌ
    • Loss B: Head์—์„œ ์˜ˆ์ธก๋œ ๊ฐ์ฒด-ํ…์ŠคํŠธ๊ฐ„ ์ •๋ ฌ

    ๐Ÿ’ก ์š”์•ฝ ๋ฌธ์žฅ

    Grounding DINO๋Š” Neck, Query Init, Head์˜ 3๋‹จ๊ณ„์— ๊ฑธ์ณ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•˜๋Š” Tight Fusion ๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•ด,

    ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ๋” ๊นŠ๊ณ  ์ •๋ฐ€ํ•œ Cross-Modality ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

    feature fusion ๋‹จ๊ณ„๋ฅผ ์„ธ์„ธํ•˜๊ฒŒ ํ•ด์ฃผ๋ฉด ๋ณต์žกํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋‹ค!

    ๐Ÿ” ๊ธฐ์ค€: Fig. 2์˜ Feature Fusion ์œ„์น˜ A / B / C

    ๋ชจ๋ธ๋ช…BackboneFusion ์œ„์น˜Text ์ฒ˜๋ฆฌํŠน์ง• ์š”์•ฝ

    GLIP DyHead A (Neck) ๋‹จ์–ด(word) ์ˆ˜์ค€ ์กฐ๊ธฐ ์œตํ•ฉ, grounding pre-train ๊ธฐ๋ฐ˜
    OV-DETR Deformable DETR B (Query Init) ๋ฌธ์žฅ(sentence) ์ˆ˜์ค€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ query ์ดˆ๊ธฐํ™”
    OmDet Sparse R-CNN C (Head) ๋ฌธ์žฅ(sentence) ๋””์ฝ”๋” ๋‹จ๊ณ„์—์„œ ์–ธ์–ด ์ •๋ณด ์‚ฝ์ž…
    MDETR DETR A, C ๋‹จ์–ด ์ˆ˜์ค€ ๋‹ค์–‘ํ•œ task์— ํ™•์žฅ ๊ฐ€๋Šฅ
    Grounding DINO DINO A, B, C (๋ชจ๋‘) sub-sentence ์ˆ˜์ค€ ์„ธ ๋‹จ๊ณ„ ๋ชจ๋‘ ์œตํ•ฉํ•œ tight fusion ๊ตฌ์กฐ

    Sub-Sentence Text Prompt ์•„์ด๋””์–ด

    ํ…์ŠคํŠธ๊ฐ€ ๋ชจ๋ธ์— ์–ด๋–ป๊ฒŒ ์ž…๋ ฅ๋˜๊ณ  ํ•ด์„๋˜๋А๋ƒ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ๊ณผ ํ‘œํ˜„ ์ •๋ ฌ ํ’ˆ์งˆ์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ์ค‘์š”ํ•œ ๋ถ€๋ถ„!

    โ“ ๊ธฐ์กด ๋ฌธ์ œ์ 

    Open-set detection์—์„œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(์ž…๋ ฅ ๋ฌธ์žฅ)๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ์ด๋ฆ„์„ ๋‚˜์—ดํ•˜๊ฑฐ๋‚˜ ์ „์ฒด ๋ฌธ์žฅ์„ ๋ฌธ๋งฅ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ฐฉ์‹์ด ์‚ฌ์šฉ๋˜์–ด ์™”์Œ.

    ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋‘ ๋ฐฉ์‹์—๋Š” ๊ฐ๊ฐ ๋‹จ์ ์ด ์กด์žฌํ•จ:

    (a) Sentence Level

    • ์ „์ฒด ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ → ๋„ˆ๋ฌด ์••์ถ•๋จ
    • ๋ฌธ์žฅ ๋‚ด ๊ฐœ๋ณ„ ๋‹จ์–ด(๊ฐ์ฒด๋ช…)์˜ ์ •๋ณด๊ฐ€ ํฌ์„๋จ
    • ์˜ˆ์‹œ: "A cat sits on a table." → ํ•˜๋‚˜์˜ ์ „์ฒด ๋ฒกํ„ฐ

    (b) Word Level

    • ๊ฐ ๋‹จ์–ด๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๊ธด ํ•˜์ง€๋งŒ
    • ์—ฌ๋Ÿฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋‹จ์ˆœ ๋‚˜์—ดํ•  ๊ฒฝ์šฐ,
    •  ์„œ๋กœ ๋ฌด๊ด€ํ•œ ๋‹จ์–ด๋“ค ๊ฐ„์˜ attention์ด ๋ฐœ์ƒํ•ด ์ •๋ณด ๊ฐ„์„ญ์ด ์ƒ๊น€
    • ์˜ˆ์‹œ: "cat, baseball glove, a table"์—์„œ ‘cat’๊ณผ ‘glove’๊ฐ€ ์„œ๋กœ ์˜ํ–ฅ์„ ์คŒ

    (C) Grounding DINO์˜ ํ•ด๊ฒฐ์ฑ…: Sub-Sentence Level Prompt

    Grounding DINO๋Š” ์œ„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Sub-sentence Level Representation์„ ๋„์ž…ํ•จ.

    โœจ ํ•ต์‹ฌ ์•„์ด๋””์–ด

    • ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ์ˆ˜์ค€์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋ฉด์„œ,
    • ์„œ๋กœ ๊ด€๋ จ ์—†๋Š” ๋‹จ์–ด๋“ค ์‚ฌ์ด์˜ attention์„ ๋งˆ์Šคํ‚น(masking) ํ•จ
    • → ์ฆ‰, "cat", "baseball glove", "a table" ๊ฐ„์—๋Š” ์„œ๋กœ ์ฃผ์˜๋ฅผ ์ฃผ์ง€ ์•Š์Œ
    • → ๋‹จ์–ด ๊ฐ„ ๊ฐ„์„ญ์„ ๋ฐฉ์ง€ํ•˜๋ฉด์„œ๋„ ๋ฏธ์„ธํ•œ ์˜๋ฏธ ํ‘œํ˜„์€ ์œ ์ง€

    ๐ŸŽฏ ๊ฒฐ๊ณผ์ ์œผ๋กœ

    • ๋ถˆํ•„์š”ํ•œ ๊ฐ„์„ญ ์—†์ด ์ •ํ™•ํ•œ ๊ฐ์ฒด-ํ…์ŠคํŠธ ์ •๋ ฌ
    • ํŠนํžˆ ๋‹ค์ค‘ ๊ฐ์ฒด ํƒ์ง€, ๋ณต์žกํ•œ ๋ฌธ์žฅ ํ”„๋กฌํ”„ํŠธ ์ƒํ™ฉ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ

    ๐Ÿ“Š ์„ฑ๋Šฅ ํ–ฅ์ƒ

    • Ablation ์‹คํ—˜ ๊ฒฐ๊ณผ, Sub-sentence Prompt๋Š”→ COCO Zero-shot์—์„œ๋„ ๊ธ์ •์  ํšจ๊ณผ
    • → ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ฆ๊ฐ€ ์—†์Œ (๊ฐ€๋ณ๊ณ  ํšจ์œจ์ )
    •  LVIS ์„ฑ๋Šฅ์„ +0.5 AP ํ–ฅ์ƒ

    ๐Ÿ“Œ ์š”์•ฝ ๋ฌธ์žฅ

    Sub-sentence Prompt๋Š” ๋‹จ์–ด ๊ฐ„ ๋ถˆํ•„์š”ํ•œ attention์„ ์ฐจ๋‹จํ•˜์—ฌ,

    ํ…์ŠคํŠธ ๋‚ด๋ถ€์˜ ์˜๋ฏธ ์ถฉ๋Œ ์—†์ด ๋” ์ •๋ฐ€ํ•œ ๊ฐ์ฒด-์–ธ์–ด ์ •๋ ฌ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.


    3. Loss Functions

    Grounding DINO๋Š” DETR-like ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ๋„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ ฌ๊ณผ ๋ฐ•์Šค ํšŒ๊ท€ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์กฐํ•ฉํ•จ. ์•„๋ž˜ ์„ธ ๊ฐ€์ง€ ์ถ•์œผ๋กœ ๊ตฌ์„ฑ๋จ:

    Contrastive Loss + Focal Loss

    Classification Loss: Contrastive Loss + Focal Loss

    • ์—ญํ• : ์˜ˆ์ธก๋œ ๊ฐ์ฒด๊ฐ€ ์–ด๋–ค ํ…์ŠคํŠธ(๋‹จ์–ด)์— ํ•ด๋‹นํ•˜๋Š”์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ
    • ๋ฐฉ์‹:
      • ๊ฐ ๋””์ฝ”๋” ์ฟผ๋ฆฌ(= ์˜ˆ์ธก๋œ ๊ฐ์ฒด ํŠน์ง•)์™€ ํ…์ŠคํŠธ ํ† ํฐ ๊ฐ„ dot product → ๋กœ์ง“(logit) ์ƒ์„ฑ
      • → ์ด ๋กœ์ง“์— ๋Œ€ํ•ด Focal Loss ์ ์šฉ
    • ์™œ Focal Loss?
      • ์ •๋‹ต ํด๋ž˜์Šค ์ˆ˜๊ฐ€ ์ ๊ณ  ๋Œ€๋ถ€๋ถ„์ด ๋ฐฐ๊ฒฝ์ธ ์ƒํ™ฉ (object detection์—์„œ ํ”ํ•จ)
      •  hard negative์— ์ง‘์ค‘, easy negative๋Š” ๋ฌด์‹œ

    L1 Loss & GIoU for box regression

    • L1 Loss: ์˜ˆ์ธก๋œ ๋ฐ•์Šค์™€ ์ •๋‹ต ๋ฐ•์Šค์˜ ์ขŒํ‘œ ์ฐจ์ด๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐ
    • GIoU (Generalized IoU):
      • ๋‹จ์ˆœ IoU๋ฅผ ๋„˜์–ด ๋ฐ•์Šค๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งž๋ฌผ๋ฆฌ๋Š”์ง€ ์ธก์ •
      • overlap์ด ์—†๋Š” ๊ฒฝ์šฐ์—๋„ gradient๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Œ
    • → ๋‘ ์†์‹ค์„ ๊ฐ™์ด ์จ์„œ ์ •ํ™•ํ•œ ์œ„์น˜ ์˜ˆ์ธก๊ณผ ๋ฐ•์Šค ํ˜•ํƒœ์˜ ์ •ํ•ฉ์„ฑ์„ ๋ชจ๋‘ ์žก์Œ

    Auxiliary Loss ๊ฐœ๋…

    Grounding DINO์—์„œ๋Š” ๋””์ฝ”๋”์˜ ๊ฐ layer์™€ encoder์ถœ๋ ฅ ๋ชจ๋‘์— auxiliary loss์ ์šฉ → ๋” ๋‚˜์€ ์ •๋ ฌ ํ•™์Šต๊ณผ ๋ฐ•์Šค ์˜ˆ์ธก ๊ฐ€๋Šฅ

    • ๊ฐ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ถœ๋ ฅ์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•œ ์†์‹ค ๊ณ„์‚ฐ
    • ๋ชฉ์ : ๋ชจ๋ธ ํ•™์Šต ์‹œ ์กฐ๊ธฐ ์ˆ˜๋ ด์„ ๋ง‰๊ณ  ์•ˆ์ •์ ์ธ gradient ํ๋ฆ„ ์ œ๊ณต
    • DETR ๊ณ„์—ด ๋ชจ๋ธ์˜ ํŠน์ง•์œผ๋กœ, “deep supervision” ํšจ๊ณผ๋ฅผ ์คŒ

    ๐Ÿ“Œ ์š”์•ฝ ๋ฌธ์žฅ

    Grounding DINO๋Š” ํ…์ŠคํŠธ-๊ฐ์ฒด ์ •๋ ฌ์„ ์œ„ํ•œ Contrastive + Focal Loss,

    ๋ฐ•์Šค ์ •๋ฐ€๋„๋ฅผ ์œ„ํ•œ L1 + GIoU Loss,

    ํ•™์Šต ์•ˆ์ •ํ™”๋ฅผ ์œ„ํ•œ Auxiliary Loss๋ฅผ ์กฐํ•ฉํ•˜์—ฌ

    ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ์ฒด ํƒ์ง€์˜ ์ •ํ™•์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋™์‹œ์— ์žก์•˜๋‹ค.


    4. Experiments & Results

    Grounding DINO๋Š” ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ(GLIP, DINO, GLIPv2 ๋“ฑ)๊ณผ ๋น„๊ต๋˜๋ฉฐ, Zero-shot ์ผ๋ฐ˜ํ™”, REC ์ •๋ฐ€๋„, ์„ธ๋ถ€ ๋ชจ๋“ˆ ๊ธฐ์—ฌ๋„ ๋ถ„์„์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•จ.

    Zero-shot COCO / LVIS / ODinW ์„ฑ๋Šฅ ๋น„๊ต (vs GLIP, DINO, GLIPv2 ๋“ฑ)

    Zero-Shot ์„ฑ๋Šฅ ํ‰๊ฐ€

    ์–ด๋–ค ๋ฐ์ดํ„ฐ ์„ธํŠธ๋“ค์ธ์ง€ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”! → annotaion, label์ƒ์„ฑ ๊ณผ์ •์„ ์ฐธ๊ณ ํ•ด์„œ ๋งŒ๋“ค๋ฉด ๋น ๋ฅด๊ฒŒ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ๊ธฐ์—!

    ๐Ÿ“ COCO Benchmark

    • Grounding DINO (w/o COCO ํ•™์Šต): 52.5 AP (SOTA)
    • GLIP ๋Œ€๋น„ +1.8 AP / DINO ๋Œ€๋น„ +0.5 AP ํ–ฅ์ƒ
    • Fine-tuning ์‹œ: 63.0 AP (→ SOTA)
    • โœ”๏ธ COCO ๋ฐ์ดํ„ฐ ์—†์ด๋„ ๋†’์€ ์„ฑ๋Šฅ = ์ œ๋กœ์ƒท ๊ฐ•์  ์ž…์ฆ

    ๐Ÿ“ LVIS Benchmark (๋กฑํ…Œ์ผ ๊ฐ์ฒด ํฌํ•จ)

    • Rare categories ์„ฑ๋Šฅ์€ GLIP๋ณด๋‹ค ๋‚ฎ์Œ (DETR ๊ณ„์—ด์˜ ํ•œ๊ณ„)
    • ๊ทธ๋Ÿฌ๋‚˜ Grounding DINO๋Š” caption data ํ™œ์šฉ ์‹œ +1.8 AP ์ƒ์Šน → ๋ฐ์ดํ„ฐ ํ™•์žฅ์„ฑ ์šฐ์ˆ˜
    • DetCLIPv2์—๋Š” ๋ฐ€๋ฆผ (ํ›จ์”ฌ ํฐ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ ๋•Œ๋ฌธ)

    ๐Ÿ“ ODinW Benchmark (Object Detection in the Wild)

    • 35๊ฐœ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ํ…Œ์ŠคํŠธ
    • 26.1 AP (Zero-shot ๊ธฐ์ค€) → SOTA ์„ฑ๋Šฅ ๊ธฐ๋ก
    • GLIPv2๋ณด๋‹ค ํ‰๊ท  ์„ฑ๋Šฅ์€ ๋น„์Šท, ํ•˜์ง€๋งŒ median ์„ฑ๋Šฅ์€ ํ›จ์”ฌ ๋†’์Œ (11.9 vs 8.9)
    • → ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ๋ณด์—ฌ์คŒ
    • ๋ชจ๋ธ ํฌ๊ธฐ๋„ ์ž‘์Œ (172M vs 232M)

    REC task์—์„œ์˜ ์„ฑ๋Šฅ ๋ฐ ํ•œ๊ณ„

    Referring Expression Comprehension (REC) Task

    ๐ŸŽฏ ๋ชฉํ‘œ:

    ์ฃผ์–ด์ง„ ๋ฌธ์žฅ(์˜ˆ: "the man in the red shirt")์œผ๋กœ ํŠน์ • ๊ฐ์ฒด๋ฅผ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ

    • Zero-shot ์ƒํ™ฉ์—์„œ GLIP ๋ฐ Grounding DINO ๋ชจ๋‘ ์„ฑ๋Šฅ ๋‚ฎ์Œ
    • → ํ•™์Šต ์‹œ์— REC ๋ฐ์ดํ„ฐ ํฌํ•จ ์—ฌ๋ถ€๊ฐ€ ์„ฑ๋Šฅ์— ๊ฒฐ์ •์  ์˜ํ–ฅ
    • RefCOCO/+/g ๋ฐ์ดํ„ฐ ํฌํ•จ ์‹œ Grounding DINO๊ฐ€ GLIP์„ ํฐ ํญ์œผ๋กœ ๋Šฅ๊ฐ€

    ๐Ÿ“Œ ๊ฒฐ๋ก :

    ํ˜„์žฌ์˜ open-set ๋ชจ๋ธ๋“ค์€ fine-grained detection (์ •๊ตํ•œ ๊ฐ์ฒด ๊ตฌ๋ถ„)์— ์•ฝํ•จ

     REC ๋ฐ์ดํ„ฐ๋‚˜ ๋” ํฐ ๋ชจ๋ธ, caption ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”

    Ablation Study ๊ฒฐ๊ณผ ์š”์•ฝ (์–ด๋–ค ๋ชจ๋“ˆ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜๋Š”์ง€)

    Grounding DINO๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ชจ๋“ˆ์ด ์„ฑ๋Šฅ์— ์–ด๋–ป๊ฒŒ ๊ธฐ์—ฌํ•˜๋Š”์ง€ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋จ.

    ๐Ÿ”ง ๋ชจ๋“ˆCOCO Zero-shotLVIS์„ค๋ช…

    Encoder Fusion (Neck ์œตํ•ฉ) +0.8 AP โœ”๏ธ ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    Language-guided Query Selection โ†—๏ธ +3.0 AP ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์˜๋ฏธ ์žˆ๋Š” ์ฟผ๋ฆฌ ์„ ํƒ
    Text Cross-Attention (Head) +0.6 AP +1.8 AP ํ…์ŠคํŠธ์™€ ์ •๋ ฌ๋œ ๋””์ฝ”๋”ฉ ๊ฐ€๋Šฅ
    Sub-Sentence Prompt ~ +0.5 AP ํ…์ŠคํŠธ ์ •๋ ฌ ์•ˆ์ •ํ™”

     

    • COCO Fine-tune ์„ฑ๋Šฅ์—” ์˜ํ–ฅ ์ ์Œ: ํ•ด๋‹น ๋ชจ๋“ˆ๋“ค์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ
    • ์ „์ฒด์ ์œผ๋กœ, Encoder Fusion > Text Cross-Attention > Query Selection > Prompt ์ˆœ์œผ๋กœ ๊ธฐ์—ฌ๋„

    ๐Ÿ“Œ ์ด์ •๋ฆฌ ํ•œ ๋ฌธ์žฅ

    Grounding DINO๋Š” ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ, ๋ฒ”์šฉ์„ฑ, ์ผ๊ด€์„ฑ ์ธก๋ฉด์—์„œ ์šฐ์ˆ˜์„ฑ์„ ์ž…์ฆํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ Tight Fusion ๊ตฌ์กฐ์™€ Query Selection, Sub-Sentence Prompt ๋“ฑ์˜ ์„ค๊ณ„๊ฐ€ ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•จ.


    5. Limitations & Insights

    Segmentation ๋ถˆ๊ฐ€

    • Grounding DINO๋Š” ๊ฐ์ฒด ์ธ์‹(box ๋‹จ์œ„)๊นŒ์ง€๋งŒ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ.
    • ๋ฐ˜๋ฉด GLIPv2 ๋“ฑ ์ผ๋ถ€ ๋ชจ๋ธ์€ segmentation (pixel ๋‹จ์œ„ ๋ถ„ํ• )๋„ ์ง€์›.
    •  Downstream Task ํ™•์žฅ์—๋Š” ์ œ์•ฝ์ด ์กด์žฌํ•จ.

    REC Task ์„ฑ๋Šฅ์€ Fine-tuning ํ•„์š”

    • Referring Expression Comprehension (REC)์—์„œ๋Š”:
      • Zero-shot ์„ฑ๋Šฅ ๋‚ฎ์Œ
      • → RefCOCO/+/g ๊ฐ™์€ ํŠนํ™” ๋ฐ์ดํ„ฐ ์—†์ด๋Š” ์„ฑ๋Šฅ ๋–จ์–ด์ง
    • ์‹ค์ œ ์‹คํ—˜์—์„œ๋„ REC ๋ฐ์ดํ„ฐ ํฌํ•จ ํ›„ ์„ฑ๋Šฅ์ด ํฐ ํญ์œผ๋กœ ํ–ฅ์ƒ๋จ
    • โœ”๏ธ Fine-tuning ์—†์ด ์ •๊ตํ•œ ๊ฐ์ฒด ์ธ์‹์€ ์–ด๋ ค์›€

    ์ผ๋ถ€ hallucination ์‚ฌ๋ก€ ์กด์žฌ

    • ์ผ๋ถ€ ์‹คํ—˜์—์„œ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์˜ค๋ฅ˜(hallucination) ๊ด€์ฐฐ๋จ
    • ํŠนํžˆ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ์• ๋งคํ•œ ๋ฌธ์žฅ, ๋“œ๋ฌธ ๊ฐ์ฒด์—์„œ ๋ฐœ์ƒ ๊ฐ€๋Šฅ
    • → open-set ๋ชจ๋ธ์˜ ์–ธ์–ด ํ•ด์„๊ณผ ์‹œ๊ฐ ์ •๋ณด ์ •๋ ฌ์˜ ๋ถˆ์™„์ „์„ฑ ๋ฐ˜์˜

    ๐Ÿ’ก ํ†ต์ฐฐ ๋ฐ ํ–ฅํ›„ ๋ฐฉํ–ฅ

    • Scaling up (๋ฐ์ดํ„ฐ, ๋ชจ๋ธ ํฌ๊ธฐ)๊ฐ€ ์—ฌ์ „ํžˆ ์ค‘์š”
    • ํŠนํžˆ:
      • ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ •๋ ฌ ๊ฐ•ํ™”๋ฅผ ์œ„ํ•œ pretraining ๊ธฐ๋ฒ•
      • REC/segmentation ํ™•์žฅ์„ ์œ„ํ•œ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต
      • Hallucination ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์ •๊ตํ•œ loss ์„ค๊ณ„๊ฐ€ ์•ž์œผ๋กœ์˜ ๊ณผ์ œ

    6. Takeaways

    Grounding DINO์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

    • Open-set object detection์— ์ตœ์ ํ™”๋œ ๊ตฌ์กฐ: ๊ธฐ์กด DINO ๊ธฐ๋ฐ˜์˜ Transformer ์•„ํ‚คํ…์ฒ˜์— language modality๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉ
    • Tight Modality Fusion: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ์ •๋ ฌ์„ neck–query–head 3๋‹จ๊ณ„์— ๊ฑธ์ณ ์ˆ˜ํ–‰, ๊ธฐ์กด ๋ชจ๋ธ(GLIP, OV-DETR ๋“ฑ) ๋Œ€๋น„ ์ •๋ ฌ ํ’ˆ์งˆ ํ–ฅ์ƒ
    • Sub-sentence Text Prompt ๋„์ž…: ๋‹จ์–ด ๊ฐ„ ์˜๋ฏธ ๊ฐ„์„ญ ์ตœ์†Œํ™”, fine-grained ํ‘œํ˜„ ๊ฐœ์„ 
    • Zero-shot ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์šฐ์ˆ˜: COCO, LVIS, ODinW์—์„œ SOTA ๋‹ฌ์„ฑ

    ์ ํ•ฉํ•œ Task vs ํ•œ๊ณ„ Task

    ์ ํ•ฉํ•œ Taskํ•œ๊ณ„ ์žˆ๋Š” Task

    Open-set Object Detection (์ œ๋กœ์ƒท ๊ฐ์ฒด ํƒ์ง€) Segmentation (ํ”ฝ์…€ ๋‹จ์œ„ ๋ถ„ํ• ์€ ๋ถˆ๊ฐ€)
    Referring Object Detection (fine-tune ์‹œ) Fine-tuning ์—†์ด REC ์„ฑ๋Šฅ ๋‚ฎ์Œ
    ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ generalization ์ผ๋ถ€ hallucination ์‚ฌ๋ก€ ์กด์žฌ

    ์šฐ๋ฆฌ ํ”„๋กœ์ ํŠธ์™€์˜ ์—ฐ๊ณ„ ๊ฐ€๋Šฅ์„ฑ

    ์šฐ๋ฆฌ ํŒ€ ๋ชฉํ‘œ: ๊ธฐ์กด ๋ชจ๋ธ์— ์ƒˆ๋กœ์šด modality ์ถ”๊ฐ€

    • Grounding DINO๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์œตํ•ฉ์˜ ์„ค๊ณ„ ์˜ˆ์‹œ๋กœ ๋งค์šฐ ์ ํ•ฉ
      •  Tight Fusion ๊ตฌ์กฐ, Cross-Modality Decoder, Query Selection ๋“ฑ์€ ํƒ€ modality ํ™•์žฅ ์‹œ์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ
    • ์šฐ๋ฆฌ๊ฐ€ ์‹œ๋„ํ•  ์ˆ˜ ์žˆ๋Š” ์•„์ด๋””์–ด:
      • ์ƒˆ๋กœ์šด modality (์˜ˆ: ์˜ค๋””์˜ค, Depth, segmentation ๋“ฑ)๋ฅผ neck/query/head ๋‹จ๊ณ„ ์ค‘ ํ•˜๋‚˜์— ์œตํ•ฉํ•˜๋Š” ์‹คํ—˜
      • Sub-sentence prompt ๊ฐœ๋…์„ ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ์–ธ์–ด ์ •๋ ฌ ๋ฐฉ์‹์—๋„ ํ™•์žฅ
      • hallucination ๊ฐ์†Œ ์ „๋žต ์ ์šฉ ์‹คํ—˜ (e.g. hard negative mining, filtering)

     

    ๐Ÿ“Œ ๋งˆ๋ฌด๋ฆฌ ํ•œ๋งˆ๋””

    Grounding DINO๋Š” “ํ…์ŠคํŠธ๋กœ ๊ฐ์ฒด๋ฅผ ์ฐพ๋Š”” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋ฉฐ,

    ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์œตํ•ฉ ์‹คํ—˜์˜ ์ถœ๋ฐœ์ ์ด์ž ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ์„œ ํฐ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

    '๋…ผ๋ฌธ ์ •๋ฆฌ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    NExT-GPT: Any-to-Any Multimodal LLM ์ •๋ฆฌ  (2) 2025.07.09

    ๋Œ“๊ธ€

Designed by Tistory.