Homepage - Shiping Ge

Shiping Ge 葛士平

PhD Student, Nanjing University

I earned my Ph.D. in Software Engineering from the School of Computer Science at Nanjing University in June 2025. I had the privilege of being supervised by Professor Qing Gu (顾庆) and Assistant Professor Zhiwei Jiang (蒋智威). Prior to my doctoral studies, I completed my B.Sc. in Network Engineering at Nantong University in 2019. Subsequently, I pursued an M.Sc. in Computer Technology at Nanjing University, where I successfully transitioned to the doctoral program in 2021. My research is centered on multimodal video understanding and analysis.

👋 If you have opportunities in academia related to my research, please email me and I would be delighted to connect and explore potential collaborations!

shipingge(at)smail.nju.edu.cn Google Scholar GitHub ORCID

Education

Nanjing University

School of Computer Science
Ph.D. Student

Sep. 2021 - present
Nanjing University

Department of Computer Science and Technology
M.S. Student

Sep. 2019 - Jun. 2021

Experience

Tencent WeChat

Research Intern

May. 2023 - Jan. 2025

Honors & Awards

Nomination Award for Outstanding Doctoral Dissertation of the School of Computer Science, Nanjing University

2025
Outstanding Graduate of Nanjing University

2025
Recognition Award of 2023 Tencent Rhino-Bird Research Elite Program

2024
Yingcai Scholarship of Nanjing University, First Prize

2024
Huawei Scholarship of Nanjing University

2023
Outstanding Postgraduate Student of Nanjing University

2023

Selected Publications (view all )

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Qin Liu, Ziyao Chen, Qing Gu

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Oral, CCF-A) 2025

We propose a novel implicit location-caption alignment paradigm based on complementary masking, which addresses the problem of unavailable supervision on event localization in the WSDVC task.

[Paper] [Code]

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Qin Liu, Ziyao Chen, Qing Gu

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Oral, CCF-A) 2025

We propose a novel implicit location-caption alignment paradigm based on complementary masking, which addresses the problem of unavailable supervision on event localization in the WSDVC task.

[Paper] [Code]

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM, CCF-B) 2025

We propose a novel end-to-end ZS-CMR framework FGAN, which can learn fine-grained alignment-aware representation for data of different modalities.

[Paper] [Code]

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM, CCF-B) 2025

We propose a novel end-to-end ZS-CMR framework FGAN, which can learn fine-grained alignment-aware representation for data of different modalities.

[Paper] [Code]

Short Video Ordering via Position Decoding and Successor Prediction

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Ziyao Chen, Qing Gu

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, CCF-A) 2024

We introduce a novel Short Video Ordering (SVO) task, curate a dedicated multimodal dataset for this task and present the performance of some benchmark methods.

[Paper] [Code]

Short Video Ordering via Position Decoding and Successor Prediction

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Ziyao Chen, Qing Gu

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, CCF-A) 2024

We introduce a novel Short Video Ordering (SVO) task, curate a dedicated multimodal dataset for this task and present the performance of some benchmark methods.

[Paper] [Code]

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

Proceedings of the 31st ACM International Conference on Multimedia (ACMMM, CCF-A) 2023

We propose a new event-aware double-branch localization paradigm to utilize event preferences for more accurate audio-visual event localization.

[Paper] [Code]

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

Proceedings of the 31st ACM International Conference on Multimedia (ACMMM, CCF-A) 2023

We propose a new event-aware double-branch localization paradigm to utilize event preferences for more accurate audio-visual event localization.

[Paper] [Code]

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

Shiping Ge, Zhiwei Jiang, Cong Wang, Zifeng Cheng, Yafeng Yin, Qing Gu

Proceedings of the ACM Web Conference (WWW, CCF-A) 2023

We design a simple encoder-decoder style multi-modal emotion recognition model, and combine it with our specially-designed adversarial training strategies to learn more robust multi-modal representation for multi-label emotion recognition.

[Paper] [Code]

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

Shiping Ge, Zhiwei Jiang, Cong Wang, Zifeng Cheng, Yafeng Yin, Qing Gu

Proceedings of the ACM Web Conference (WWW, CCF-A) 2023

[Paper] [Code]

Warning

Action required

Education

Experience

Honors & Awards

Selected Publications (view all )

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Short Video Ordering via Position Decoding and Successor Prediction

Short Video Ordering via Position Decoding and Successor Prediction

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

All publications