FG-BMK

News

An extended version of our paper is now on arXiv.
Major benchmark update — we added new dataset and expanded the suite with additional experimental validation and analysis.
Our paper has been accepted to ICLR 2026! The ICLR version is available on arXiv.
We released our FG-BMK benchmark!

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks—fundamental to computer vision—remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training-design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions—offering guidance for future data construction and model design.

Overview

FG-BMK probes fine-grained ability through two complementary evaluation paradigms—human-oriented dialogue recognition and machine-oriented feature discriminability—and turns them into a step-by-step diagnosis that pinpoints where, and why, LVLMs fall short.

Leaderboard

Hierarchical granularity recognition (human-oriented). Values are Multiple-Choice / True-or-False accuracy; click CUB-200 or iNat2021 to expand taxonomic levels.

Hardest category Overall accuracy Easiest category A wider band = stronger knowledge bias (less consistent across categories).

Last updated · 05 / 2026

Dataset

Thirteen fine-grained datasets

Spanning multiple domains · 0.28M+ images · 1.01M+ questions.

Nature & Biology

Industrial & Manufactured

FGVC Aircraft Stanford Cars Products-10K Wine

Daily Life & Specialized

DeepFashion Food-101 SkinCon MTARSI

Key Findings

From assessment to diagnosis,
improvement, and robustness

Machine-Oriented Results

Per-dataset visual discriminability

Seven representative vision backbones across twelve fine-grained datasets. Contrastively-trained encoders (EVA-CLIP, DINOv2, InternVL) consistently lead.

EVA-CLIP

CoCa

DINOv2

BEiT3

LLaVA

InternVL

Qwen

CUB-200

Stanford Dogs

Stanford Cars

FGVC Aircraft

Flowers-102

iNat2021

Food-101

DeepFashion

VegFru

Products-10K

SkinCon

Wine

Citation

BibTeX

Cite this work

@article{yu2026fgbmk,
  title   = {Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis},
  author  = {Yu, Hong-Tao and Xie, Chen-Wei and Peng, Yuxin and Belongie, Serge and Wei, Xiu-Shen},
  journal = {arXiv preprint arXiv:2606.19053},
  year    = {2026}
}

@inproceedings{yu2026fgbmk_iclr,
  title     = {Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation},
  author    = {Yu, Hong-Tao and Peng, Yuxin and Belongie, Serge and Wei, Xiu-Shen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}