Vous devez vous connecter ou créer un compte utilisateur pour candidater. Les candidatures transmises via d’autres plateformes ou sites ne seront pas prises en compte.

Intitulé du sujet: Thematic visual grounding

Sujet

Codirection: Co-encadrement Sylvain Lobry (MCU)

Nombre de mois: 36 mois

Ecole Doctorale: ED 130 - Informatique, Telecommunications et Electronique de Paris

Unité de recherche et équipe:

LIPADE EA2517

Equipe Systèmes Intelligents de Perception

Coordonnées de l’équipe:

UFR Mathématiques et Informatique

45 rue des Saints-Pères

75006 Paris

Secteur: Sciences Physiques et Ingénierie / Physical sciences and Engineering

Langue attendue: Anglais

Niveau de langue attendu: B2

Description

Description du sujet:

Visual grounding is a task that aims at locating objects in an image based on a natural languagequery. This task, along with image captioning, visual question answering or content based imageretrieval links image data with the text modality. Numerous works have been produced in thelast decade about visual grounding in the computer vision community [1]–[3]. These work mostoften consider both modality separatly, through a dedicated encoder (e.g. a convolutional neuralnetwork for images, a recurrent neural network for the text). Both encoded representations arethen merged, potentially using attention mechanisms, to obtain a common latent representation.Recently, text-image foundation models such as CLIP (Contrastive Image Language Pre-training, [4]) have changed the paradigm for visual grounding models [5]. Indeed, leveragingthe shared semantics between language and images is a key element for the task.While great amount of works have been produced in the computer vision community onthe task of visual grounding on natural images, there is a lack of research works on this taskfor thematic domains such as medical imaging and remote sensing. In both of these domains,there is a need to precisely locate particular objects, following precise definitions, in images. Inaddition, the image of a particular scene (e.g. an organ in medical image, a geographical area inremote sensing) can be made through several acquisitions (e.g. an MRI stack or a time series).As such, we are interested in the question:

How can visual grounding be made domain specific?

 

In medical imaging, visual grounding is an important research task aiming at assisting medicaldoctors navigate the huge amount of visual data, for instance in radiology or histopathology. Arecent trend in the field of computational pathology is to rely on a good feature extractor robustto stain variations or varous hostpital sites protocols [6] and to leverage on it to augment theassistance to clinical practice. In the field of medical imaging, the phrase visual grounding as described in [7] allows to pave the way towards the clinical practice at horizon 2030 [8]. Wealready used Transformer modelling for medical images in the field of histopathology, hence ourteam would like to leverage on this visual architecture to integrate textual interactions[9].In remote sensing, the task of visual grounding has been proposed by [10]. In this work, theauthors take inspiration from [11] to build a visual grounding dataset from OpenStreetMap dataon remote sensing images. In addtion, the authors propose a two-stream network leveragingattention to perform the visual grounding task. In [12] another dataset, RSVG, is built from atarget detection dataset (DIOR [13]). Similarly, a two-stream method is proposed and evaluatedon this proposed dataset. Among others, the task of visual grounding is used in [14] to builda grounded large vision-language for remote sensing data. These datasets and methods have incommon to not be specifically tuned for one particular sensor. In addition, they cannot handleother remote sensing modalities such as SAR. Finally, they cannot work on time series.

 

To answer the question raised in this research proposal, we propose to consider two aspects:the specific semantics of the thematic domains and the particular domain distributions of theimages. As such, we propose to decompose this research in three main research objectives:

1.Visual grounding from multi-modal data:this objective aims at simultaneouslygrounding an object on different data (e.g. optical and SAR) that may present differentgeometries.

2.Visual grounding on stack of images:we propose to take into account the factthat acquisition can be divided into several images (e.g. MRI stacks or multi-temporaldata). Our objective is to perform the visual grounding task on this data, finding the bestrepresentation of a description in a stack of images.

3.Pixel-level grounding:For thematic images, there can be a need to have a preciselocation of objects of different shape. As such, we want to propose a new grounding taskin which the output is not the bounding box of the object of interest, but a segmentation.

Compétences requises:

The candidate must have a strong background in Computer Science or Mathematics. Abackground in either computer vision, image processing or natural language processing (NLP) iswelcome. Knowledge in Python, C or C++ and in a deep learning framework is a plus.

Références bibliographiques:

[1] A. Rohrbach, M. Rohrbach, R. Hu, T. DarrellandB. Schiele, “Grounding of textualphrases in images by reconstruction,”inComputer Vision–ECCV 2016: 14th EuropeanConference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14Springer, 2016,pages817–834.

[2] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. DarrellandM. Rohrbach,“Multimodal compact bilinear pooling for visual question answering and visualgrounding,”arXiv preprint arXiv:1606.01847, 2016

[3] C. Deng, Q. Wu, Q. Wu, F. Hu, F. LyuandM. Tan, “Visual grounding viaaccumulated attention,”inProceedings of the IEEE conference on computer vision andpattern recognition2018,pages7746–7755.

[4] A. Radford, J. W. Kim, C. Hallacyandothers, “Learning transferable visual modelsfrom natural language supervision,”inInternational conference on machine learningPMLR, 2021,pages8748–8763.

[5] W. Jin, S. Mukherjee, Y. Chengandothers, “Grill: Grounded vision-language pre-training via aligning text and image regions,”arXiv preprint arXiv:2305.14676, 2023.

[6] G. Wolflein, D. Ferber, A. R. Meneghettiandothers, “A good feature extractor isall you need for weakly supervised learning in histopathology,” 2023. eprint: 2311.11772.url: https://github.com/georg-wolflein/histaug.

[7] Z. Chen, Y. Zhou, A. Tranandothers, “Medical phrase grounding with region-phrase context contrastive alignment,” 2023. arXiv: 2303.07618.

[8] A. B. et al., “Computational pathology in 2030: A delphi study forecasting the roleof ai in pathology within the next decade,” 2023.url: https://doi.org/10.1016/j.ebiom.2022.104427.

[9] Z. Guo, Q. Wang, H. Muller, T. Palpanas, N. LomenieandC. Kurtz, “A hierarchicaltransformer encoder to improve entire neoplasm segmentation on whole slideimage of hepatocellular carcinoma,” 2023. arXiv: 2307.05800.url: https://arxiv.org/abs/2307.05800.

[10] Y. Sun, S. Feng, X. Li, Y. Ye, J. KangandX. Huang, “Visual grounding in remotesensing images,”inProceedings of the 30th ACM International Conference on Multimedia2022,pages404–412.

 

[11] S. Lobry, D. Marcos, J. MurrayandD. Tuia, “Rsvqa: Visual question answering forremote sensing data,”IEEE Transactions on Geoscience and Remote Sensing,jourvol58,number12,pages8555–8566, 2020.

[12] Y. Zhan, Z. XiongandY. Yuan, “Rsvg: Exploring data and models for visualgrounding on remote sensing data,”IEEE Transactions on Geoscience and RemoteSensing,jourvol61,pages1–13, 2023.

[13] K. Li, G. Wan, G. Cheng, L. MengandJ. Han, “Object detection in optical remotesensing images: A survey and a new benchmark,”ISPRS journal of photogrammetryand remote sensing,jourvol159,pages296–307, 2020.

[14] K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. KhanandF. S. Khan,“Geochat: Grounded large vision-language model for remote sensing,”arXivpreprint arXiv:2311.15826, 2023.