Medical-CXR-VQA

医学视觉问答（Medical Visual Question Answering，VQA）是医学多模态大语言模型（LLMs）中的关键任务，旨在回答与输入医学图像相关的临床问题。该技术有望提高医疗专业人员的工作效率，同时减轻公共卫生系统的负担，特别是在资源匮乏的国家。然而，现有的医学VQA数据集规模较小，仅包含简单问题（相当于分类任务），缺乏语义推理和临床知识。作者先前的工作提出了一种基于规则的方法，使用临床驱动的图像鉴别VQA基准。但在相同信息覆盖范围内，使用基于规则的方法提取标签的错误率高达85%。作者训练了一种LLM方法来提取标签，将准确率提高了62%。作者还与两位临床专家对标签进行了全面评估，通过评估100个样本来帮助微调LLM。基于训练好的LLM模型，作者提出了一个面向LLMs的大规模医学VQA数据集Medical-CXR-VQA，重点关注胸部X光图像。涉及的问题包括异常情况、位置、程度和类型等详细信息。基于该数据集，作者通过构建三种不同类型的关系图（空间关系图、语义关系图和隐式关系图），跨越图像区域、问题和语义标签，提出了一种新的VQA方法。作者利用图注意力学习不同问题的逻辑

xianweichengxiang

可视化图片

可视化图片 1

可视化图片 2

可视化图片 3

数据集元信息

维度	3D
模态	mammography
任务类型	other
解剖结构	胸部
解剖区域	胸部
类别数	6
数据量	780,014
文件格式	图像-文本对

文件结构

The Medical-CXR-VQA dataset is currently under review in Physionet. The authors will attach the link once it's available.

图像尺寸统计

统计类型	间距 (mm)	尺寸
最小值	`-`	`-`
中位值	`-`	`-`
最大值	`-`	`-`

引用

@article{HU2024103279,
title = {Interpretable medical image Visual Question Answering via multi-modal relationship graph learning},
journal = {Medical Image Analysis},
volume = {97},
pages = {103279},
year = {2024},
issn = {1361-8415},
doi = {https://doi.org/10.1016/j.media.2024.103279},
url = {https://www.sciencedirect.com/science/article/pii/S1361841524002044},
author = {Xinyue Hu and Lin Gu and Kazuma Kobayashi and Liangchen Liu and Mengliang Zhang and Tatsuya Harada and Ronald M. Summers and Yingying Zhu},
keywords = {Visual Question Answering, Medical dataset, Graph neural network, Multi-modal large vision language model, Large Language Model, Chain of thought},
abstract = {Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. However, existing medical VQA datasets are small and only contain simple questions (equivalent to classification tasks), which lack semantic reasoning and clinical knowledge. Our previous work proposed a clinical knowledge-driven image difference VQA benchmark using a rule-based approach (Hu et al., 2023). However, given the same breadth of information coverage, the rule-based approach shows an 85% error rate on extracted labels. We trained an LLM method to extract labels with 62% increased accuracy. We also comprehensively evaluated our labels with 2 clinical experts on 100 samples to help us fine-tune the LLM. Based on the trained LLM model, we proposed a large-scale medical VQA dataset, Medical-CXR-VQA, using LLMs focused on chest X-ray images. The questions involved detailed information, such as abnormalities, locations, levels, and types. Based on this dataset, we proposed a novel VQA method by constructing three different relationship graphs: spatial relationships, semantic relationships, and implicit relationship graphs on the image regions, questions, and semantic labels. We leveraged graph attention to learn the logical reasoning paths for different questions. These learned graph VQA reasoning paths can be further used for LLM prompt engineering and chain-of-thought, which are crucial for further fine-tuning and training multi-modal large language models. Moreover, we demonstrate that our approach has the qualities of evidence and faithfulness, which are crucial in the clinical field. The code and the dataset is available at https://github.com/Holipori/Medical-CXR-VQA.}
}

来源信息

官方网站：
访问官网

下载链接：

下载数据

公开下载，无需权限

相关论文：
查看论文

发布日期： 2024-07

统计信息

创建时间： 2025-09-10 10:21

更新时间： 2025-09-16 15:16