EdgeTAM: On-Device Track Something Model > 자유게시판

본문 바로가기

EdgeTAM: On-Device Track Something Model

페이지 정보

작성자 Tommie 댓글 0건 조회 2회 작성일 25-11-13 16:23

본문

On high of Segment Anything Model (SAM), iTagPro features SAM 2 additional extends its functionality from picture to video inputs by means of a reminiscence financial institution mechanism and obtains a remarkable performance in contrast with earlier strategies, iTagPro features making it a basis model for video segmentation job. On this paper, we goal at making SAM 2 way more environment friendly in order that it even runs on cellular units whereas maintaining a comparable efficiency. Despite a number of works optimizing SAM for higher effectivity, we find they are not enough for SAM 2 because all of them deal with compressing the image encoder, while our benchmark exhibits that the newly launched reminiscence consideration blocks are additionally the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to cut back the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely saved frame-level recollections with a lightweight Transformer that comprises a set set of learnable queries.



professional-mini-tracker-vehicle-gps-covert-hidden-tracking-device-logger-297.gifGiven that video segmentation is a dense prediction activity, we find preserving the spatial structure of the reminiscences is crucial so that the queries are cut up into international-stage and patch-level teams. We additionally suggest a distillation pipeline that further improves the performance without inference overhead. DAVIS 2017, buy itagpro MOSE, SA-V val, and SA-V check, while operating at sixteen FPS on iPhone 15 Pro Max. SAM to handle each picture and video inputs, with a memory financial institution mechanism, iTagPro online and is educated with a brand new large-scale multi-grained video tracking dataset (SA-V). Despite attaining an astonishing efficiency compared to earlier video object segmentation (VOS) fashions and allowing extra various user prompts, SAM 2, iTagPro support as a server-facet foundation model, isn't environment friendly for on-machine inference. CPU and NPU. Throughout the paper, we interchangeably use iPhone and ItagPro iPhone 15 Pro Max for simplicity.. SAM for better efficiency solely consider squeezing its image encoder because the mask decoder is extremely lightweight. SAM 2. Specifically, SAM 2 encodes previous frames with a reminiscence encoder, and iTagPro features these body-level memories together with object-degree pointers (obtained from the mask decoder) serve as the memory bank.



These are then fused with the features of present body through memory consideration blocks. As these memories are densely encoded, this leads to an enormous matrix multiplication throughout the cross-consideration between present frame features and reminiscence options. Therefore, despite containing comparatively fewer parameters than the picture encoder, the computational complexity of the reminiscence consideration is just not inexpensive for on-system inference. The speculation is additional proved by Fig. 2, where decreasing the number of reminiscence consideration blocks nearly linearly cuts down the general decoding latency and inside every memory attention block, removing the cross attention gives the most significant speed-up. To make such a video-primarily based tracking model run on gadget, in EdgeTAM, we look at exploiting the redundancy in movies. To do that in apply, we suggest to compress the raw frame-stage recollections earlier than performing memory attention. We start with naïve spatial pooling and observe a big efficiency degradation, particularly when utilizing low-capacity backbones.



However, naïvely incorporating a Perceiver also leads to a extreme drop in performance. We hypothesize that as a dense prediction task, the video segmentation requires preserving the spatial construction of the memory financial institution, which a naïve Perceiver discards. Given these observations, iTagPro features we propose a novel lightweight module that compresses body-degree reminiscence function maps while preserving the 2D spatial construction, named 2D Spatial Perceiver. Specifically, we cut up the learnable queries into two groups, where one group capabilities similarly to the unique Perceiver, the place each question performs world consideration on the input options and outputs a single vector iTagPro reviews as the body-level summarization. In the opposite group, the queries have 2D priors, i.e., every query is barely liable for compressing a non-overlapping native patch, thus the output maintains the spatial structure while decreasing the entire variety of tokens. Along with the architecture enchancment, we additional suggest a distillation pipeline that transfers the information of the powerful teacher SAM 2 to our student model, which improves the accuracy for free of charge of inference overhead.



We find that in each stages, aligning the iTagPro features from image encoders of the unique SAM 2 and our environment friendly variant benefits the performance. Besides, we additional align the function output from the reminiscence consideration between the trainer SAM 2 and our pupil mannequin in the second stage so that along with the picture encoder, iTagPro features reminiscence-related modules also can receive supervision indicators from the SAM 2 instructor. SA-V val and test by 1.Three and 3.3, respectively. Putting collectively, we suggest EdgeTAM (Track Anything Model for Edge devices), that adopts a 2D Spatial Perceiver for efficiency and information distillation for accuracy. Through comprehensive benchmark, we reveal that the latency bottleneck lies within the reminiscence consideration module. Given the latency analysis, we propose a 2D Spatial Perceiver that significantly cuts down the reminiscence attention computational price with comparable performance, which may be built-in with any SAM 2 variants. We experiment with a distillation pipeline that performs feature-clever alignment with the unique SAM 2 in both the picture and video segmentation phases and observe efficiency enhancements with none additional cost throughout inference.

댓글목록

등록된 댓글이 없습니다.

충청북도 청주시 청원구 주중동 910 (주)애드파인더 하모니팩토리팀 301, 총괄감리팀 302, 전략기획팀 303
사업자등록번호 669-88-00845    이메일 adfinderbiz@gmail.com   통신판매업신고 제 2017-충북청주-1344호
대표 이상민    개인정보관리책임자 이경율
COPYRIGHTⒸ 2018 ADFINDER with HARMONYGROUP ALL RIGHTS RESERVED.

상단으로