Shiping Ge 葛士平
Logo PhD Student, Nanjing University

I am a final-year Ph.D. student in the School of Computer Science at Nanjing University, where I am fortunate to be supervised by Professor Qing Gu (顾庆) and Assistant Professor Zhiwei Jiang (蒋智威). I earned my B.Sc. in Network Engineering from Nantong University in 2019, and shortly thereafter, I began my M.Sc. in Computer Technology at Nanjing University. In 2021, I successfully passed the examination to upgrade to the doctoral program, and I am now pursuing a Ph.D. in Software Engineering. My research focuses on multimodal video understanding and analysis.

👋 If you have opportunities in industry or academia related to my research, please email me and I would be delighted to connect and explore potential collaborations!


Education
  • Nanjing University
    Nanjing University
    School of Computer Science
    Ph.D. Student
    Sep. 2021 - present
  • Nanjing University
    Nanjing University
    Department of Computer Science and Technology
    M.S. Student
    Sep. 2019 - Jun. 2021
Experience
  • Tencent WeChat
    Tencent WeChat
    Research Intern
    May. 2023 - Jan. 2025
Honors & Awards
  • Recognition Award of 2023 Tencent Rhino-Bird Research Elite Program
    2024
  • Yingcai Scholarship of Nanjing University, First Prize
    2024
  • Huawei Scholarship of Nanjing University
    2023
  • Outstanding Postgraduate Student of Nanjing University
    2023
  • Yingcai Scholarship of Nanjing University, Second Prize
    2021
Selected Publications (view all )
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Qin Liu, Ziyao Chen, Qing Gu

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Oral, CCF-A) 2025

We propose a novel implicit location-caption alignment paradigm based on complementary masking, which addresses the problem of unavailable supervision on event localization in the WSDVC task.

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Qin Liu, Ziyao Chen, Qing Gu

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Oral, CCF-A) 2025

We propose a novel implicit location-caption alignment paradigm based on complementary masking, which addresses the problem of unavailable supervision on event localization in the WSDVC task.

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval
Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM, CCF-B) 2025

We propose a novel end-to-end ZS-CMR framework FGAN, which can learn fine-grained alignment-aware representation for data of different modalities.

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM, CCF-B) 2025

We propose a novel end-to-end ZS-CMR framework FGAN, which can learn fine-grained alignment-aware representation for data of different modalities.

Short Video Ordering via Position Decoding and Successor Prediction
Short Video Ordering via Position Decoding and Successor Prediction

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Ziyao Chen, Qing Gu

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, CCF-A) 2024

We introduce a novel Short Video Ordering (SVO) task, curate a dedicated multimodal dataset for this task and present the performance of some benchmark methods.

Short Video Ordering via Position Decoding and Successor Prediction

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Ziyao Chen, Qing Gu

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, CCF-A) 2024

We introduce a novel Short Video Ordering (SVO) task, curate a dedicated multimodal dataset for this task and present the performance of some benchmark methods.

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

Proceedings of the 31st ACM International Conference on Multimedia (ACMMM, CCF-A) 2023

We propose a new event-aware double-branch localization paradigm to utilize event preferences for more accurate audio-visual event localization.

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, Qing Gu

Proceedings of the 31st ACM International Conference on Multimedia (ACMMM, CCF-A) 2023

We propose a new event-aware double-branch localization paradigm to utilize event preferences for more accurate audio-visual event localization.

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation
Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

Shiping Ge, Zhiwei Jiang, Cong Wang, Zifeng Cheng, Yafeng Yin, Qing Gu

Proceedings of the ACM Web Conference (WWW, CCF-A) 2023

We design a simple encoder-decoder style multi-modal emotion recognition model, and combine it with our specially-designed adversarial training strategies to learn more robust multi-modal representation for multi-label emotion recognition.

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition via Adversarial Masking and Perturbation

Shiping Ge, Zhiwei Jiang, Cong Wang, Zifeng Cheng, Yafeng Yin, Qing Gu

Proceedings of the ACM Web Conference (WWW, CCF-A) 2023

We design a simple encoder-decoder style multi-modal emotion recognition model, and combine it with our specially-designed adversarial training strategies to learn more robust multi-modal representation for multi-label emotion recognition.

All publications