Video Moment Retrieval (VMR) aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, VMR has drawn significant attention from researchers in both communities. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: Moment-based approach and Clip-based approach. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end. we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. Meanwhile, despite their effectiveness, Moment-based and Clip-based methods mostly focus only on aligning the query and single-level chip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we also propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. This report will detail these two works and also share our deeper insights.