作為未來(lái)互聯(lián)網(wǎng)3.0的主要應(yīng)用場(chǎng)景,元字審成為目前包括IT領(lǐng)域在內(nèi)很多應(yīng)用的熱點(diǎn)話題,。報(bào)告從基本的數(shù)據(jù)概念講起,重點(diǎn)結(jié)合講者主持的國(guó)家重點(diǎn)研究計(jì)劃項(xiàng)目的研發(fā)進(jìn)展,,對(duì)目前元宇宙的一些機(jī)會(huì)和發(fā)展現(xiàn)狀,,提出了自己的一些理解和觀點(diǎn),進(jìn)而針對(duì)工業(yè)互聯(lián)網(wǎng)未來(lái)的應(yīng)用需求,,介紹了工業(yè)元宇宙的相關(guān)技術(shù)及發(fā)展趨勢(shì),,進(jìn)而討論了智能技術(shù)在工業(yè)領(lǐng)域更多場(chǎng)景的落地應(yīng)用。
Video Moment Retrieval (VMR) aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, VMR has drawn significant attention from researchers in both communities. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: Moment-based approach and Clip-based approach. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end. we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. Meanwhile, despite their effectiveness, Moment-based and Clip-based methods mostly focus only on aligning the query and single-level chip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we also propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. This report will detail these two works and also share our deeper insights.
主辦:CCF
承辦:CCF協(xié)同計(jì)算專業(yè)委員會(huì),、太原科技大學(xué)