In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer vision. To support this pioneering task, we present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects to enable comprehensive fMRI signal capture across various settings, thereby laying a foundation for future research. Furthermore, we propose MinD-3D, a novel and effective three-stage framework specifically designed to decode the brain's 3D visual information from fMRI signals, demonstrating the feasibility of this challenging task. The framework begins by extracting and aggregating features from fMRI frames through a neuro-fusion encoder, subsequently employs a feature bridge diffusion model to generate visual features, and ultimately recovers the 3D object via a generative transformer decoder. We assess the performance of MinD-3D using a suite of semantic and structural metrics and analyze the correlation between the features extracted by our model and the visual regions of interest (ROIs) in fMRI signals. Our findings indicate that MinD-3D not only reconstructs 3D objects with high semantic relevance and spatial similarity but also significantly enhances our understanding of the human brain's capabilities in processing 3D visual information.
Here are some examples of the fMRI-Shape dataset across different subjects. You can download the dataset by this link: https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape.
The qualitative results generated by LEA-3D, fMRI-PTE-3D, and our method are presented. GT indicates the ground-truth 3D objects. All the objects have been rendered into a 2D format
Overview of the MinD-3D Framework. Our approach combines a Neuro-Fusion Encoder for extracting features from fMRI frames, a Feature Bridge Diffusion Model for generating visual features from these fMRI signals, and a Latent Adapted Decoder based on the Argus 3D shape generator for reconstructing 3D objects. This integrated system effectively aligns and translates brain signals into accurate 3D visual representations. Note that the CLIP encoder is only for training the model, while it is not used for the inference stage.
@misc{gao2023mind3d,
title={MinD-3D: Reconstruct High-quality 3D objects in Human Brain},
author={Jianxiong Gao and Yuqian Fu and Yun Wang and Xuelin Qian and Jianfeng Feng and Yanwei Fu},
year={2023},
eprint={2312.07485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}