<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>sji</title>
    <link>https://aimaster.tistory.com/</link>
    <description></description>
    <language>ko</language>
    <pubDate>Mon, 25 May 2026 18:37:57 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>식피두</managingEditor>
    <image>
      <title>sji</title>
      <url>https://tistory1.daumcdn.net/tistory/3202161/attach/d4814609f9fe42689ce1d8480fe62b04</url>
      <link>https://aimaster.tistory.com</link>
    </image>
    <item>
      <title>ec2 혹은 서버에 애플리케이션 서버를 https로 띄워야할 때</title>
      <link>https://aimaster.tistory.com/101</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;가끔 프론트 애플리케이션을 https 페이지에서 테스트하게 될 때가 있는데,&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;https 페이지에선 보안상의 이유로 https 통신을 지원하지 않는 api 서버에 요청을 보낼 수 없다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;따라서 애플리케이션 서버를 https 지원되는 서버에 띄워서..&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;https 웹서버로 요청을 받은 다음,&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;특정 포트에 임시로 띄워놓은 서버 애플리케이션으로 리다이렉트를 해줘야 한다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;더 나은 방법이 있을 수 있지만,&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;내가 주로 쓰는 방법은 다음과 같다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;애플리케이션 서버가 EC2에서 동작할 때 (ex. express application running on port 8080)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;다음과 같은 순서로 작업을 진행...&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;ec2에 nginx 설치&lt;/li&gt;
&lt;li&gt;route53을 통해 도메인 연결 (없으면 가비아에서 구매)&lt;/li&gt;
&lt;li&gt;도메인 연결 후 certbot을 이용해서 https 설정을 해준다. (+자동 갱신까지...)&lt;/li&gt;
&lt;li&gt;nginx conf 파일에 리다이렉션을 위한 설정을 해준다. (아래 참고)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;참고 자료&lt;/h3&gt;
&lt;figure id=&quot;og_1622364922924&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;AWS EC2와 도메인 연결 (가비아)&quot; data-og-description=&quot;&amp;nbsp;목표) AWS EC2에서 실행중인 웹 서버를 구매한 도메인과 연결하기 (가비아에서 구매한 도메인) 1. 우선 AWS의 Route53 서비스로 이동합니다. (https://console.aws.amazon.com/route53) 2. 두 버튼 중 아무거나..&quot; data-og-host=&quot;sovovy.tistory.com&quot; data-og-source-url=&quot;https://sovovy.tistory.com/37&quot; data-og-url=&quot;https://sovovy.tistory.com/37&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/MIL3f/hyKniPyduS/QxBgbKIoFfcmUSqRflcwVk/img.png?width=800&amp;amp;height=585&amp;amp;face=0_0_800_585,https://scrap.kakaocdn.net/dn/bcHov6/hyKo4vmTU4/aT62Z3GbapxYvBLdhaOSQ0/img.png?width=800&amp;amp;height=585&amp;amp;face=0_0_800_585,https://scrap.kakaocdn.net/dn/bifusd/hyKo1ZIpMu/iX0N0MKqNqqTO6ZTZvfHQk/img.png?width=900&amp;amp;height=659&amp;amp;face=0_0_900_659&quot;&gt;&lt;a href=&quot;https://sovovy.tistory.com/37&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://sovovy.tistory.com/37&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/MIL3f/hyKniPyduS/QxBgbKIoFfcmUSqRflcwVk/img.png?width=800&amp;amp;height=585&amp;amp;face=0_0_800_585,https://scrap.kakaocdn.net/dn/bcHov6/hyKo4vmTU4/aT62Z3GbapxYvBLdhaOSQ0/img.png?width=800&amp;amp;height=585&amp;amp;face=0_0_800_585,https://scrap.kakaocdn.net/dn/bifusd/hyKo1ZIpMu/iX0N0MKqNqqTO6ZTZvfHQk/img.png?width=900&amp;amp;height=659&amp;amp;face=0_0_900_659');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;AWS EC2와 도메인 연결 (가비아)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;목표) AWS EC2에서 실행중인 웹 서버를 구매한 도메인과 연결하기 (가비아에서 구매한 도메인) 1. 우선 AWS의 Route53 서비스로 이동합니다. (https://console.aws.amazon.com/route53) 2. 두 버튼 중 아무거나..&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;sovovy.tistory.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;figure id=&quot;og_1622365015521&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[Nginx] Let's Encrypt를 통해 Nginx에서 무료로 https 설정하기 - JP-HOSTING&quot; data-og-description=&quot;✅일본서버호스팅 &amp;middot; 프록시 &amp;middot; 무제한디도스방어 &amp;middot; 고객센터&quot; data-og-host=&quot;jp-hosting.jp&quot; data-og-source-url=&quot;https://jp-hosting.jp/nginx-lets-encrypt%EB%A5%BC-%ED%86%B5%ED%95%B4-nginx%EC%97%90%EC%84%9C-%EB%AC%B4%EB%A3%8C%EB%A1%9C-https-%EC%84%A4%EC%A0%95%ED%95%98%EA%B8%B0/&quot; data-og-url=&quot;https://jp-hosting.jp/nginx-lets-encrypt를-통해-nginx에서-무료로-https-설정하기/&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cRly85/hyKnpVs6xs/wcK11uVDTxy7hwW6RhiJck/img.png?width=1485&amp;amp;height=989&amp;amp;face=0_0_1485_989&quot;&gt;&lt;a href=&quot;https://jp-hosting.jp/nginx-lets-encrypt%EB%A5%BC-%ED%86%B5%ED%95%B4-nginx%EC%97%90%EC%84%9C-%EB%AC%B4%EB%A3%8C%EB%A1%9C-https-%EC%84%A4%EC%A0%95%ED%95%98%EA%B8%B0/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://jp-hosting.jp/nginx-lets-encrypt%EB%A5%BC-%ED%86%B5%ED%95%B4-nginx%EC%97%90%EC%84%9C-%EB%AC%B4%EB%A3%8C%EB%A1%9C-https-%EC%84%A4%EC%A0%95%ED%95%98%EA%B8%B0/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cRly85/hyKnpVs6xs/wcK11uVDTxy7hwW6RhiJck/img.png?width=1485&amp;amp;height=989&amp;amp;face=0_0_1485_989');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[Nginx] Let's Encrypt를 통해 Nginx에서 무료로 https 설정하기 - JP-HOSTING&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;✅일본서버호스팅 &amp;middot; 프록시 &amp;middot; 무제한디도스방어 &amp;middot; 고객센터&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;jp-hosting.jp&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;아래와 같이 nginx config를 설정해준다..&lt;/p&gt;
&lt;pre id=&quot;code_1622365261722&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;server {
    server_name testdomain.com;
        
    ...
        
    location /application_server_address/ {
        proxy_pass http://localhost:8080/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }
    
    ...
&lt;/code&gt;&lt;/pre&gt;</description>
      <category>http&amp;amp;server</category>
      <category>https</category>
      <category>nginx</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/101</guid>
      <comments>https://aimaster.tistory.com/101#entry101comment</comments>
      <pubDate>Sun, 30 May 2021 17:57:58 +0900</pubDate>
    </item>
    <item>
      <title>ML General 잡질문/답변 (기술면접)</title>
      <link>https://aimaster.tistory.com/99</link>
      <description>&lt;p data-ke-size=&quot;size18&quot;&gt;기존 글에도 정리를 해왔지만,&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;기술 면접 준비하면서 정리 했던 것들 몇 가지 보충해서 정리해봤다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;틀린 내용이 있을 수 있으니 의심하면서 보길...&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Backpropagation 계산법 (이상한 곳에서는 간단한 계산 시킬 수도...)&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;로컬 그래디언트에 업스트림 그래디언트를 곱해줌으로써 구하는 것을 기본 원칙으로 계산 하면 된다&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-origin-width=&quot;1125&quot; data-origin-height=&quot;854&quot; width=&quot;474&quot; height=&quot;360&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/J28nd/btq4TBJPWUo/Lzm2CU4T6gxVKixXAeZ811/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/J28nd/btq4TBJPWUo/Lzm2CU4T6gxVKixXAeZ811/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/J28nd/btq4TBJPWUo/Lzm2CU4T6gxVKixXAeZ811/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJ28nd%2Fbtq4TBJPWUo%2FLzm2CU4T6gxVKixXAeZ811%2Fimg.png&quot; data-origin-width=&quot;1125&quot; data-origin-height=&quot;854&quot; width=&quot;474&quot; height=&quot;360&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Regression vs. Classification&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;확률적인 관점에서 설명할 줄 알면 든든...&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;가정하는 분포가 뭔지 (이산? 연속? / 베르누이? 가우시안?)&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;가정 분포에 따라 Loss는 어떻게 달라지는지...&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Sigmoid 쓰는 이유?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;In order to map predicted value to probability&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-origin-width=&quot;520&quot; data-origin-height=&quot;198&quot; data-filename=&quot;스크린샷 2021-05-14 오후 4.04.29.png&quot; width=&quot;200&quot; height=&quot;76&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bd4m5p/btq4VCgRxfR/ANULqxz7ViRyuQ8LGYoYLk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bd4m5p/btq4VCgRxfR/ANULqxz7ViRyuQ8LGYoYLk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bd4m5p/btq4VCgRxfR/ANULqxz7ViRyuQ8LGYoYLk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbd4m5p%2Fbtq4VCgRxfR%2FANULqxz7ViRyuQ8LGYoYLk%2Fimg.png&quot; data-origin-width=&quot;520&quot; data-origin-height=&quot;198&quot; data-filename=&quot;스크린샷 2021-05-14 오후 4.04.29.png&quot; width=&quot;200&quot; height=&quot;76&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;a는 sigmoid의 기울기를 결정, b는 중심 위치를 결정&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Convariance vs. Correlation&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;두 변수가 서로 어떻게 관계를 가지는지를 보여주는 수치 Covariance&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;0이면 unrelated&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;+이면 same direction&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;-이면 opposite direction&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;Covariance의 nomalized version이 correlation으로&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;Covariance를 각 변수의 표준편차 곱으로 나눠주면 된다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;멀티 라벨 분류 문제에서 Micro/Macro - Precision/Recall/F-score?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;micro는 클래스 구분 없이 글로벌하게 TP, TN, FP, FN를 구해서 계산&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;macro는 각 클래스별로 구한 뒤 평균&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;뉴럴넷 학습시 에러가 증가할 경우 뭘 의심 해야 하나?&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Gradient&lt;/li&gt;
&lt;li&gt;Feature Scaling / Data Shuffling&lt;/li&gt;
&lt;li&gt;Learning Rate&lt;/li&gt;
&lt;li&gt;버그 ; ex) 값의 일부를 NaN으로 바꿔버리는 지점&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Large Weight 은 Overfit의 가능성으로 이어진다&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;입력이 조금만 변해도 큰 웨잇에 의해 값의 변화 폭이 커질 테니까&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;이 때는 regularization을 이용해서 weight의 크기를 제한할 수 있어야 함.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;BatchNorm 장점?&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;초기화 기법을 뭘 쓰든 상관 없이 잘 학습 될 여지&lt;/li&gt;
&lt;li&gt;fc layer 다음에, non-linear 전에 위치 시켜 사용&lt;/li&gt;
&lt;li&gt;activation을 가우시안 분포로 맞춰 줌 (but, 실제론 그렇지 않다는 논문도 있다. 아래 블로그 글 참고)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;실제 장점은 error surface를 스무스하게 만들어주기 때문에 학습이 원활하고 빠르게...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;주어진 배치 속에서 dim. 별로 normalization을 위한 평균/표준편차를 누적해 나가면서 계산&lt;/li&gt;
&lt;li&gt;깊은 네트워크여도 Weight이 중첩 되면서 값이 커지는 것을 방지한다.&lt;/li&gt;
&lt;li&gt;grad 흐름 원활
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;LR을 키울 수&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Regularization 효과도 있음&lt;/li&gt;
&lt;/ul&gt;
&lt;figure id=&quot;og_1620972834300&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[개념 정리] Batch Normalization in Deep Learning - part 2.&quot; data-og-description=&quot;논문에서 저자가 말한 것 처럼 Batch Normalization (BN)는 네트워크 레이어의 Internal Covariate Shift (ICS)문제를 해결하기 위해 나온 기법이다. BN을 이용하면 확실하게 학습 속도가 빨라지고 안정적으로 &quot; data-og-host=&quot;cvml.tistory.com&quot; data-og-source-url=&quot;https://cvml.tistory.com/6&quot; data-og-url=&quot;https://cvml.tistory.com/6&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bYnBBY/hyKcEEHvYe/iPQvET5xmp7dFrKLhfyjR1/img.jpg?width=800&amp;amp;height=390&amp;amp;face=0_0_800_390,https://scrap.kakaocdn.net/dn/by5bf4/hyKcLYbnuq/KvaV4BzVoRVIYkq1ONBJck/img.jpg?width=800&amp;amp;height=390&amp;amp;face=0_0_800_390,https://scrap.kakaocdn.net/dn/h5Msw/hyKcDZ7Ya7/EsyLTybkmQKsqRaViVaKX0/img.png?width=894&amp;amp;height=377&amp;amp;face=0_0_894_377&quot;&gt;&lt;a href=&quot;https://cvml.tistory.com/6&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://cvml.tistory.com/6&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bYnBBY/hyKcEEHvYe/iPQvET5xmp7dFrKLhfyjR1/img.jpg?width=800&amp;amp;height=390&amp;amp;face=0_0_800_390,https://scrap.kakaocdn.net/dn/by5bf4/hyKcLYbnuq/KvaV4BzVoRVIYkq1ONBJck/img.jpg?width=800&amp;amp;height=390&amp;amp;face=0_0_800_390,https://scrap.kakaocdn.net/dn/h5Msw/hyKcDZ7Ya7/EsyLTybkmQKsqRaViVaKX0/img.png?width=894&amp;amp;height=377&amp;amp;face=0_0_894_377');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size18&quot;&gt;[개념 정리] Batch Normalization in Deep Learning - part 2.&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size18&quot;&gt;논문에서 저자가 말한 것 처럼 Batch Normalization (BN)는 네트워크 레이어의 Internal Covariate Shift (ICS)문제를 해결하기 위해 나온 기법이다. BN을 이용하면 확실하게 학습 속도가 빨라지고 안정적으로&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size18&quot;&gt;cvml.tistory.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Batch size &amp;amp; Learning rate의 관계?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;일반적으로는 &lt;b&gt;배치가 커질 수록 LR을 크게&lt;/b&gt; 가져갈 수 있다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;배치가 커질 수록 error surface 상에서 내려가는 방향에 대해 좀 더 확신을 가질 수 있기 때문이다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;LR을 고정하고 배치 사이즈를 키울 경우 LR을 줄이는 효과가 있음 + 학습은 더 빨라짐 (메모리만 넉넉하다면야...)&lt;/li&gt;
&lt;li&gt;배치사이즈를 줄이면 일반화 효과는 커진다는 소리도 있다. &quot;Revisiting Small Batch Training for&amp;nbsp; Deep Neural Network&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;왜 입력을 Normalize 해야할까?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;입력을 노말라이즈 하지 않으면, 특정 피쳐와 관련된 웨잇의 그래디언트가 다른 피쳐와 연관된 웨잇의 그래디언트 보다 상대적으로 클 수가 있다. 이렇게 되면, 최적화시 그래디언트 방향이 스무스 하지 못해질 여지가 있고, 지그재그 패턴을 보일 수 있는데 이렇게 되면 수렴이 느려진다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;왜 BatchNorm에 의해 입력 분포가 비슷해지면 학습의 속도가 증가?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;Internal Covariate Shift가 보정 되면서(? -&amp;gt; 실제론 그렇지 않다고 함), 달라지는 입력에 의해 Hidden Unit의 분포가 변하는 양이 감소한다. 예를 들어서, A 고양이 이미지를 학습하다가, 종이 다른 B 고양이 이미지가 입력 된다면 입력 분포가 갑자기 달라지기 때문에 hidden Unit도 거기에 대응해서 크게 변해야할 수 있지만, BatchNorm은 그럴 가능성을 줄여 준다. Hidden Unit이 적게 변하게 만들어 줌으로써 일반화 효과로도 이어질 수 있다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;여러 가지 초기화 방법들&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Bias 는 0으로 초기화 해도 상관 없음&lt;br /&gt;(&lt;a href=&quot;https://stackoverflow.com/questions/43498037/why-add-zero-bias-in-neural-networks&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://stackoverflow.com/questions/43498037/why-add-zero-bias-in-neural-networks&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Xavier 초기화
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이전/다음 노드 개수에 의존&lt;/li&gt;
&lt;li&gt;np.random.randn(in, out) / np.sqrt(in)&lt;span style=&quot;color: #009a87;&quot;&gt; # 입력의 개수로 스케일링&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;He 초기화
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ReLU에 좀 더 적합&lt;/li&gt;
&lt;li&gt;ReLU가 입력 분포의 절반은 날려버리니까&lt;/li&gt;
&lt;li&gt;np.random.randn(in, out) / np.sqrt(in&lt;span style=&quot;color: #ee2323;&quot;&gt;/2&lt;/span&gt;) &lt;span style=&quot;color: #009a87;&quot;&gt;# 입력 절반이 날아가는 것 반영&lt;/span&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;여러 가지 Activation에 대해 살펴두기&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;sigmoid 같은 경우 양 끝 단에서 saturation 문제가 발생 (기울기가 거의 0에 수렴 -&amp;gt; vanishing gradient 여지)&lt;/li&gt;
&lt;li&gt;tanh 도 saturation 문제는 있으나, zero centered 되어 있음&lt;/li&gt;
&lt;li&gt;ReLU = max(0, x) ; dead neuron의 여지&lt;/li&gt;
&lt;li&gt;Leaky ReLU = max(0.1x, x)&lt;/li&gt;
&lt;li&gt;PReLU = max(ax, x)&lt;/li&gt;
&lt;li&gt;그 외, Swish등 요즘 다양한 것 나왔던데 추가로 살펴볼 것&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Feature Trasnform&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;뉴럴넷의 레이어를 통과함에 따라 데이터의 분포가 주어진 태스크를 해결하기에 적합한 분포로 변해가는 것&lt;/li&gt;
&lt;li&gt;중간에 Non-linear Activation을 통과하면서 Linear 하게 분류할 수 없는 입력 피쳐들이 분류가 용이하도록 분포에 변화가 생기는 것&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;왜 Activation이 필요할까?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;리니어 + 리니어 + 리니어 레이어만 가지고 네트워크를 만들면&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;리니어 레이어 한 개 쓴 것보다 나아질 게 없다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;즉, &lt;b&gt;복잡한 패턴을&lt;/b&gt; 잡아낼 수가 없다. (capture non-linear relationship)&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;이 때, non-linear activation을 추가함으로써 입력 피쳐 공간을 구기거나 펼침이 가능해지고(non-linear feature transform), 결국 linearly seperable 한 피쳐를 얻어낼 수 있다.(분류 문제라면)&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;왜 입력의 범위가 zero-centered 되어야 좋은건지?&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;(시그모이드를 액티베이션으로 썼을 때를 가정)&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;시그모이드는 출력 값이 zero-centered가 아니다. 따라서, 그 출력 값이 항상 양수가 되버린다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;이 말은 모든 weight에 대한 그래디언트가 upstream gradient의 부호에 의해 결정 된다는 소리...&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;따라서 다음 레이어의 그래디언트의 부호에 따라 zig-zag path로 최적화 될 여지가 있다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;*뭔 소린가 싶다면 back-prop 계산 복습 해볼 것&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;ReLU는 zero-centered이긴 하지만 입력 분포의 절반은 버린다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;초기화를 잘못하거나 LR이 너무 높을 경우 Dead ReLU로 이어질 수 있다고 함.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;</description>
      <category>DL&amp;amp;ML/concept</category>
      <category>Interview</category>
      <category>ML</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/99</guid>
      <comments>https://aimaster.tistory.com/99#entry99comment</comments>
      <pubDate>Fri, 14 May 2021 15:50:22 +0900</pubDate>
    </item>
    <item>
      <title>Kaggle Shopee 대회 top-solution 정리</title>
      <link>https://aimaster.tistory.com/98</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;얼마 전 종료 되었던 &lt;b&gt;Shoppe - Price Match Guarantee&lt;/b&gt; 대회&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;비록 뒤늦게 참여해서 아쉽게 메달은 획득하지 못했지만, 짧은 기간 동안 즐겁게 팀플레이를 할 수 있었던 대회였다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;상품 이미지와 제목이 주어졌을 때 유사한 제품 id를 찾는 멀티모달리티를 이용한 대회였다.&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;1876&quot; data-origin-height=&quot;406&quot; data-filename=&quot;스크린샷 2021-05-13 오후 11.06.57.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/HeX5m/btq4OhdKOum/XejfmoLnb3I51HYpcV6Hz1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/HeX5m/btq4OhdKOum/XejfmoLnb3I51HYpcV6Hz1/img.png&quot; data-alt=&quot;https://www.kaggle.com/c/shopee-product-matching&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/HeX5m/btq4OhdKOum/XejfmoLnb3I51HYpcV6Hz1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHeX5m%2Fbtq4OhdKOum%2FXejfmoLnb3I51HYpcV6Hz1%2Fimg.png&quot; data-origin-width=&quot;1876&quot; data-origin-height=&quot;406&quot; data-filename=&quot;스크린샷 2021-05-13 오후 11.06.57.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;https://www.kaggle.com/c/shopee-product-matching&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;탑솔루션이 몇 개 공개되어, 상위권 사람들이 보여준 핵심 아이디어 몇 가지를 정리해보았다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;1등 솔루션 (한국인 yoonsoo님, from embeddings to matches)&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/c/shopee-product-matching/discussion/238136&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://www.kaggle.com/c/shopee-product-matching/discussion/238136&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;eca_nfnet_l1, xlm-roberta-large, xlm-roberta-base, bert-base-indonesian-1.5G, indobert-large-p1, bert-base-multilingual-uncased (&lt;span style=&quot;color: #009a87;&quot;&gt;인도네시안을 쓰는게 의미가 있을 까 싶었는데, 많이들 썼다&lt;/span&gt;)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Arcface&lt;/span&gt;를 이용해 모델 학습
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;충분히 큰 마진을 두는 것이 임베딩의 퀄리티를 결정하는 데 있어 중요했음&lt;/li&gt;
&lt;li&gt;하지만 convergence 이슈가 있었고 다음의 방법으로 해결
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;학습이 진행 됨에 따라 margin을 점차적으로 증가 시킴&lt;/li&gt;
&lt;li&gt;웜업 스텝을 크게 둠&lt;/li&gt;
&lt;li&gt;&lt;b&gt;cosine head에는 러닝레잇을 더 크게 둠&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;gradient clipping 적용&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이미지 모델의 경우 margin 0.8 ~ 1.0이 적합, 텍스트 모델엔 0.6 ~ 0.8이 적합 (&lt;span style=&quot;color: #009a87;&quot;&gt;마진이 중요한진 몰랐네...&lt;/span&gt;)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;0.2 부터 시작해서 학습 도중에 점차적으로 끌어 올림&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;google landmark recognition 솔루션을 참고(&lt;a href=&quot;https://arxiv.org/abs/2010.05350&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://arxiv.org/abs/2010.05350&lt;/a&gt;)하여 class-size-adaptive margin 기법을 도입함 (&lt;span style=&quot;color: #009a87;&quot;&gt;비슷했던 컴페티션을 참고하는 것이 중요&lt;/span&gt;)&lt;/li&gt;
&lt;li&gt;임베딩(Gloval Average Pooling 혹은 그냥 Pooling에 의해 생성 된) 이후에 BatchNorm + feature-wise Norm을 적용해주는게 좋았음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이미지와 텍스트 임베딩을 이용해 매칭을 하는 방법. 세 가지를 시도했고, 마지막 방식이 가장 좋았음
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;텍스트 임베딩만 가지고 매칭, 이미지 임베딩만 가지고 매칭 후 그 둘을 union (보통 사람들이 한 것)&lt;/li&gt;
&lt;li&gt;텍스트 임베딩과 이미지 임베딩을 컨캣해서 combinded match를 수행 (첫 번째 방식보다 훨씬 좋음)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각각을 기준으로 distance가 threshold 이하로 떨어지는 것들을 고르고 (strong suggest)&lt;/li&gt;
&lt;li&gt;combined distance가 좀 더 루즈한 threshold 이하로 떨어지는 것들을 고름(moderately suggest)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;세 가지를 union (가장 좋음)&lt;/li&gt;
&lt;li&gt;image + text model을 jointly 학습했을 땐 별로&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Iterative Neighbor Blending 방법을 제안 (개별 임베딩을 개선)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;cosine distance = 1 - cosine similarity&lt;/li&gt;
&lt;li&gt;K NNS(Nearest Neighbor Search)을 cosine similarity 메트릭을 이용해서 적용. threshold 이하만 이웃으로 취급.&lt;br /&gt;(+ 모든 매치는 최소 2개는 갖도록 일부 조정, threshold에 아무것도 안걸릴 경우)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Neighborhood Blending&lt;/span&gt; (다른 솔루션에도 Query Extention이란 이름으로 언급 된 부분)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;앞서 구한 이웃 끼리 엣지로 연결, 엣지의 웨잇은 cosine similarity으로 취급하여 그래프화
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;threshold 넘는 애들만 연결 되었다고 가정&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;특정 노드의 임베딩(Query)을 주변 이웃의 임베딩을 weighted sum 함으로써 업데이트 시켜줌 (Query Extension)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이렇게 함으로써 &lt;b&gt;클러스터를 좀 더 명확&lt;/b&gt;히 할 수 있다고 한다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;여기에 다시 NNS를 적용해서 새 이웃을 얻을 수 있음.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;evaluation metric이 개선되는게 멈출 때 까지 반복 (답안 링크의 코드 참고)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;그 외에 image 학습시 cutmix (0.1) + horizontal flip only augmentation이 좋았다고 한다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;2등 솔루션&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/c/shopee-product-matching/discussion/238022&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://www.kaggle.com/c/shopee-product-matching/discussion/238022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;2스테이지 모델을 구현해서
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1 스테이지 ; 이미지, 텍스트, 이미지+텍스트에 대한 임베딩을 얻음&lt;/li&gt;
&lt;li&gt;2 스테이지 ;&lt;span style=&quot;color: #ee2323;&quot;&gt; meta-model&lt;/span&gt; 을 학습 시켜서 각 쌍의 품목이 같은 라벨 그룹에 속하는지 판단하는 모델을 구현함
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;LightGBM &amp;amp; Graph Attention Network&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;NFNet-F0, ViT embeddings을 이용한 코사인 유사도&lt;/li&gt;
&lt;li&gt;CurricularFace loss (Arcface보다 낫다고 함)&lt;/li&gt;
&lt;li&gt;SAM 옵티마이저 (&lt;span style=&quot;color: #009a87;&quot;&gt;공부 필요&lt;/span&gt;)&lt;/li&gt;
&lt;li&gt;indonesian-BERT, multilingual-BERT, paraphrase-XLM&lt;/li&gt;
&lt;li&gt;Text Similarity / Image Similarity / Text + Image Similarity
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특히 마지막의 multimodal similaritysms NFNet-F0와 Indonesian BERT의 마지막 레이어를 컨캣 시켜서 학습함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;1등 솔루션 처럼 graph feature를 사용해서 pagerank를 이용해 특정 위치 노드를 업데이트 한 것 처럼 보임
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;+ Query Extention ; augmented embedding which weighted averaged neighbors&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이웃을 구한 결과 A-B와 B-A가 일치 되도록 후처리 한듯&lt;/li&gt;
&lt;li&gt;여러 라벨 그룹에 걸쳐 있는 아이템을 후처리 한듯&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;* &lt;/span&gt;6등 솔루션&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/c/shopee-product-matching/discussion/238010&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://www.kaggle.com/c/shopee-product-matching/discussion/238010&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;multi-modal model을 학습을 시켰는데, 다른 사람과 다른 점은
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이미지 임베딩 만으로 arcface 학습&lt;/li&gt;
&lt;li&gt;텍스트 임베딩 만으로 arcface 학습&lt;/li&gt;
&lt;li&gt;이미지+텍스트 임베딩도 동시에 arcface 학습
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우리의 경우엔 이것만 시도하다 학습이 잘 안되서 포기했는데...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;세 개의 태스크를 동시에 학습!! (링크에 그림 참고)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;따라서 하나의 모델 안에 이미지/텍스트 기반의 백본이 있고 출력으로는 3개의 임베딩이 나오는 구조!&lt;/li&gt;
&lt;li&gt;이런식으로 멀티모달 모델을 여러개 만들어 앙상블을 시도함.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이 때, 유클리디언 or 코사인 유사도 둘 중 하나만 쓴게 아니라 둘 다 씀 (&lt;span style=&quot;color: #009a87;&quot;&gt;왜 둘 중 하나만 쓸 생각만 했을까...&lt;/span&gt;)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각각의 결과(4개 모델이라면 12세트x2의 예측 결과)에 대해 voting을 함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;모델별 학습 파라미터를 보면 새롭게 추가된 head는 백본 보다 큰 러닝레잇을 부여한 것을 볼 수 있음&lt;/li&gt;
&lt;li&gt;스케쥴러는 모델별로 다양하게...
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;linaer+warmup&lt;/li&gt;
&lt;li&gt;ReduceLROnPlateau&lt;/li&gt;
&lt;li&gt;ConsineAnnealingWarmRestart&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;라벨 그룹 개수가 최소 2개가 되게 끔 threshold 보정 (다른 답안과 마찬가지)&lt;/li&gt;
&lt;li&gt;제품의 단위를 추출해서(200gram, 200gr) 단위가 다르면 매치에서 제거&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;14등 솔루션&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/c/shopee-product-matching/discussion/238033&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://www.kaggle.com/c/shopee-product-matching/discussion/238033&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;정리 예정&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>DL&amp;amp;ML/code.data.tips</category>
      <category>Kaggle</category>
      <category>shopee</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/98</guid>
      <comments>https://aimaster.tistory.com/98#entry98comment</comments>
      <pubDate>Thu, 13 May 2021 22:58:57 +0900</pubDate>
    </item>
    <item>
      <title>모델이 학습 이후에 모든 입력에 대해 동일한 출력을 내는 문제</title>
      <link>https://aimaster.tistory.com/97</link>
      <description>&lt;p data-ke-size=&quot;size18&quot;&gt;유사도 판단을 위한 임베딩을 얻기 위해&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;pretrained model을 가져다 fine-tuning을 통해 태스크에 좀 더 적합한 임베딩을 만들기 위해서&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;Arcface Loss를 학습을 했는데, 학습 후에 모든 입력에 대해 동일한 출력을 내는 기이한 현상을 겪었다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;Loss는 줄어드는 것을 보고 학습은 되고 있는게 아닌가 싶었는데...&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;아래 글에서 힌트를 얻어 확인해보니, 결국 Learning Rate이 너무 높은게 문제였다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;아마 이 때, ArcFace에 속한 FC Layer에 높은 LR을 부여하다가,&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;너무 큰 LR을 할당해버리는 바람에 학습이 이상하게 된 듯 싶다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;a href=&quot;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1620970005594&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Outputs from a simple DNN are always the same whatever the input is&quot; data-og-description=&quot;I have built a DNN with only one hidden layer, the following are the parameters: input_size = 100 hidden_size = 20 output_size = 2 def init(): self.linear1 = nn.Linear() self.linear2 = nn.Linear() def forward(): x1 = F.leaky_relu() return F.leaky_relu() #u&quot; data-og-host=&quot;discuss.pytorch.org&quot; data-og-source-url=&quot;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&quot; data-og-url=&quot;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bHSdzA/hyKcJzclRc/FObUPQ95eCbzsXZMLlJFtK/img.png?width=512&amp;amp;height=512&amp;amp;face=0_0_512_512,https://scrap.kakaocdn.net/dn/zbyV6/hyKcHIbY8G/Wf0Sfv67YNfpFyGUBqKL6K/img.png?width=512&amp;amp;height=512&amp;amp;face=0_0_512_512,https://scrap.kakaocdn.net/dn/bd5ghX/hyKcKESGGz/GIhSTrBKFU22H70OPlpzXK/img.png?width=1025&amp;amp;height=205&amp;amp;face=0_0_1025_205&quot;&gt;&lt;a href=&quot;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://discuss.pytorch.org/t/outputs-from-a-simple-dnn-are-always-the-same-whatever-the-input-is/14969&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bHSdzA/hyKcJzclRc/FObUPQ95eCbzsXZMLlJFtK/img.png?width=512&amp;amp;height=512&amp;amp;face=0_0_512_512,https://scrap.kakaocdn.net/dn/zbyV6/hyKcHIbY8G/Wf0Sfv67YNfpFyGUBqKL6K/img.png?width=512&amp;amp;height=512&amp;amp;face=0_0_512_512,https://scrap.kakaocdn.net/dn/bd5ghX/hyKcKESGGz/GIhSTrBKFU22H70OPlpzXK/img.png?width=1025&amp;amp;height=205&amp;amp;face=0_0_1025_205');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size18&quot;&gt;Outputs from a simple DNN are always the same whatever the input is&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size18&quot;&gt;I have built a DNN with only one hidden layer, the following are the parameters: input_size = 100 hidden_size = 20 output_size = 2 def init(): self.linear1 = nn.Linear() self.linear2 = nn.Linear() def forward(): x1 = F.leaky_relu() return F.leaky_relu() #u&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size18&quot;&gt;discuss.pytorch.org&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>DL&amp;amp;ML/code.data.tips</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/97</guid>
      <comments>https://aimaster.tistory.com/97#entry97comment</comments>
      <pubDate>Mon, 3 May 2021 20:53:33 +0900</pubDate>
    </item>
    <item>
      <title>Knowledge Distillation: A Survey</title>
      <link>https://aimaster.tistory.com/96</link>
      <description>&lt;p&gt;&lt;a href=&quot;https://arxiv.org/pdf/2006.05525.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;arxiv.org/pdf/2006.05525.pdf&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;모델 경량화 방법인 Knowledge Distillation (이하 KD) 서베이 논문.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;KD가 무엇으로 구성되고 어떻게 학습이 이루어지는지에 관한 것들을 정리해보고자 한다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;딥러닝 모델을 한정된 자원을 가진 모바일 디바이스로 배포하고 싶다면 모델의 경량화는 필수다.&lt;/p&gt;
&lt;p&gt;이 때 KD를 이용하면 &lt;span style=&quot;color: #006dd7;&quot;&gt;모델을 &lt;span style=&quot;color: #ee2323;&quot;&gt;압축&lt;/span&gt;&lt;/span&gt;시킬 수 있을 뿐만 아니라 &lt;span style=&quot;color: #006dd7;&quot;&gt;추론 속도도&lt;span style=&quot;color: #ee2323;&quot;&gt; 가속&lt;/span&gt;&lt;/span&gt;시킬 수 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;딥러닝 기반의 실서비스를 구성할 때도 모델 경량화기법이 유용하게 활용될 수 있다.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;모델 Compression / Acceleration 방법&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Parameter Pruning / Sharing&lt;/li&gt;
&lt;li&gt;Low-rank factorization&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Knowledge Distillation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;등등&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Knowledge Distillation&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;작은 모델의 Student, 큰 모델의 Teacher 모델로 구성되어 (&lt;span style=&quot;color: #006dd7;&quot;&gt;Capacity Gap&lt;/span&gt;)&lt;/p&gt;
&lt;p&gt;Teacher의 Knowledge를 Student에게 주입시킨다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.51.28.png&quot; data-origin-width=&quot;1322&quot; data-origin-height=&quot;520&quot; width=&quot;461&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kQThP/btq3CdXaMO5/lKtjVO4bW1N7pqiGOQIju1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kQThP/btq3CdXaMO5/lKtjVO4bW1N7pqiGOQIju1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kQThP/btq3CdXaMO5/lKtjVO4bW1N7pqiGOQIju1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkQThP%2Fbtq3CdXaMO5%2FlKtjVO4bW1N7pqiGOQIju1%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.51.28.png&quot; data-origin-width=&quot;1322&quot; data-origin-height=&quot;520&quot; width=&quot;461&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;KD는 다음 세 개 요소로 구성 된다.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Knowledge&lt;/li&gt;
&lt;li&gt;Distillation Algorithm&lt;/li&gt;
&lt;li&gt;Teacher Student Architecture&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;아래는 바닐라 KD 모델의 구조도이다.&lt;/p&gt;
&lt;p&gt;Student 모델은 실제 정답에 대해서 학습하는 동시에&lt;/p&gt;
&lt;p&gt;Teacher에 의해 생성된 Soft Targets들에 대해서도 학습한다. (Cross Entropy)&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.59.33.png&quot; data-origin-width=&quot;1400&quot; data-origin-height=&quot;380&quot; width=&quot;627&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bvL8ZS/btq3DKGSJX7/fW5WHuUnZZBan88yTmgwh0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bvL8ZS/btq3DKGSJX7/fW5WHuUnZZBan88yTmgwh0/img.png&quot; data-alt=&quot;bench mark model of vanilla KD&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bvL8ZS/btq3DKGSJX7/fW5WHuUnZZBan88yTmgwh0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbvL8ZS%2Fbtq3DKGSJX7%2FfW5WHuUnZZBan88yTmgwh0%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.59.33.png&quot; data-origin-width=&quot;1400&quot; data-origin-height=&quot;380&quot; width=&quot;627&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;bench mark model of vanilla KD&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Knowledge?&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;KD를 통해서 Student에게 주입 시키고자 하는 Knowledge는 3가지 정도로 구분 된다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; width=&quot;452&quot; height=&quot;NaN&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.54.28.png&quot; data-origin-width=&quot;1036&quot; data-origin-height=&quot;654&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/chUz4f/btq3DtyBdOm/rrhFQT8JAm0T3sSaTfEWU1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/chUz4f/btq3DtyBdOm/rrhFQT8JAm0T3sSaTfEWU1/img.png&quot; data-alt=&quot;Response/Feature/Relation Based Knowledge&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/chUz4f/btq3DtyBdOm/rrhFQT8JAm0T3sSaTfEWU1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FchUz4f%2Fbtq3DtyBdOm%2FrrhFQT8JAm0T3sSaTfEWU1%2Fimg.png&quot; width=&quot;452&quot; height=&quot;NaN&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.54.28.png&quot; data-origin-width=&quot;1036&quot; data-origin-height=&quot;654&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Response/Feature/Relation Based Knowledge&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Response-Based Knowledge&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Teacher 모델의 마지막 예측 값 (logits)&lt;/li&gt;
&lt;li&gt;Teacher 모델의 확률 분포를 soft targets 삼아 Student에게 학습 시킬 수&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;두 분포를 KL-Divergence loss를 이용해 학습&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; width=&quot;416&quot; height=&quot;NaN&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.55.56.png&quot; data-origin-width=&quot;684&quot; data-origin-height=&quot;268&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/9dbvs/btq3FdBOxgO/NO6HNpQgZRKtDnVHRT74aK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/9dbvs/btq3FdBOxgO/NO6HNpQgZRKtDnVHRT74aK/img.png&quot; data-alt=&quot;Responsed-Based KD&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/9dbvs/btq3FdBOxgO/NO6HNpQgZRKtDnVHRT74aK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F9dbvs%2Fbtq3FdBOxgO%2FNO6HNpQgZRKtDnVHRT74aK%2Fimg.png&quot; width=&quot;416&quot; height=&quot;NaN&quot; data-filename=&quot;스크린샷 2021-04-28 오전 12.55.56.png&quot; data-origin-width=&quot;684&quot; data-origin-height=&quot;268&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Responsed-Based KD&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Feature-Based Knowledge&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;뉴럴넷의 intermiediate 레이어에 대한 Knowledge를 주입
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;hints 라고도 표현함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.05.46.png&quot; data-origin-width=&quot;688&quot; data-origin-height=&quot;316&quot; width=&quot;385&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ASbAZ/btq3AQODoj9/tdlBNdrcGJtOy7nASLkYS1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ASbAZ/btq3AQODoj9/tdlBNdrcGJtOy7nASLkYS1/img.png&quot; data-alt=&quot;Feature-Based Knowledge&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ASbAZ/btq3AQODoj9/tdlBNdrcGJtOy7nASLkYS1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FASbAZ%2Fbtq3AQODoj9%2FtdlBNdrcGJtOy7nASLkYS1%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.05.46.png&quot; data-origin-width=&quot;688&quot; data-origin-height=&quot;316&quot; width=&quot;385&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Feature-Based Knowledge&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Relation-Based Knowledge&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;직관적으로 와닿지는 않는 방법이다. 실험 부분을 참조해봐도 자주 쓰이지는 않는 것 처럼 보인다.&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;feature들 간의 relation을 Knowledge로서 학습&lt;/li&gt;
&lt;li&gt;inner product btw. features from 2 layers&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Distillation Scheme&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;Teacher와 Student를 학습하기 위한 학습 스킴에도 세 가지가 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Offline&lt;/span&gt;&lt;/b&gt; 방법은 Teacher 모델을 프리트레인 시켜 놓고 Student에게 KD를 적용한다. (2-phase)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Online&lt;/span&gt;&lt;/b&gt; 방법은 Teacher와 Stduent를 동시에 학습 시키거나, 번갈아 가면서 학습 시킨다. (1-phase?)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Self&lt;/span&gt;&lt;/b&gt; Distillation은 깊은 레이어의 표현을 얕은 레이어로 주입시킨다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.09.39.png&quot; data-origin-width=&quot;694&quot; data-origin-height=&quot;636&quot; width=&quot;288&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/z8yxH/btq3DssX1uA/8YqNHOI5MyY3DMr9BQReQ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/z8yxH/btq3DssX1uA/8YqNHOI5MyY3DMr9BQReQ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/z8yxH/btq3DssX1uA/8YqNHOI5MyY3DMr9BQReQ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fz8yxH%2Fbtq3DssX1uA%2F8YqNHOI5MyY3DMr9BQReQ1%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.09.39.png&quot; data-origin-width=&quot;694&quot; data-origin-height=&quot;636&quot; width=&quot;288&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Teacher-Student Architecture&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;Student를 Teacher 보다 작게 만들 때에는&lt;/p&gt;
&lt;p&gt;depth/width를 작게 할지, 적은 레이어를 쓸지,&lt;/p&gt;
&lt;p&gt;precision에 제한을 둘지(quantization) 등의 고민이 필요하다.&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;여기서 NAS(Network Architecture Search) 기법을 활용하기도...&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.17.33.png&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;466&quot; width=&quot;294&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bnffUZ/btq3yKn6Co3/2kbvf3yF1CtVjytSqNHkvK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bnffUZ/btq3yKn6Co3/2kbvf3yF1CtVjytSqNHkvK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bnffUZ/btq3yKn6Co3/2kbvf3yF1CtVjytSqNHkvK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbnffUZ%2Fbtq3yKn6Co3%2F2kbvf3yF1CtVjytSqNHkvK%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-28 오전 1.17.33.png&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;466&quot; width=&quot;294&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Distillation Algorithm&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;가장 간단한 방법은? 이미 언급 된 것 처럼
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Teacher-Student간의 Knowledge를 직접 매치시켜 학습&lt;/span&gt; 시키는 방법이다.
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;Reponse를 비교하든, Feature를 비교하든...&lt;/li&gt;
&lt;li&gt;분류 문제라면 CE 혹은 KL-divergence Loss를 써서...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;그 외에도 다양한 기법이 시도되고 있다.
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;GAN을 이용해서 synthetic data를 생성(hard example)해서 학습&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;여러가지 Teacher와 함께&lt;/span&gt; 학습
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;각각의 pair(T_1~N, S)로 학습을 하거나&lt;/li&gt;
&lt;li&gt;Teacher들의 출력을 평균내어 averaged logits과 비교하여 학습하거나&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Data-Free Distillation이라고 해서, 데이터 없이 KD를 하는 방법&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Quantization Distillation&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;네트워크의 precision을 낮추기&lt;/li&gt;
&lt;li&gt;High precision teacher &amp;amp; Low precision student&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;활용&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;BERT의 CLS 토큰 임베딩을 hint로 삼아 Student에게 Transfer
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;문장 분류, 매칭, MRC에서 유용할 수&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;two-stage transformer로 KD
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;TinyBERT&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;BERT =&amp;gt; BiLSTM&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>DL&amp;amp;ML/papers</category>
      <category>Knowledge Distillation</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/96</guid>
      <comments>https://aimaster.tistory.com/96#entry96comment</comments>
      <pubDate>Wed, 28 Apr 2021 01:30:59 +0900</pubDate>
    </item>
    <item>
      <title>모델 학습이 잘 되는지 여부를 판단할 수 있는 지표</title>
      <link>https://aimaster.tistory.com/95</link>
      <description>&lt;p&gt;모델 학습이 잘 진행되는지&lt;b&gt;&lt;span style=&quot;color: #009a87;&quot;&gt; parameter norm&lt;/span&gt;&lt;/b&gt;과 &lt;b&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;gradient norm&lt;/span&gt;&lt;/b&gt;을 활용할 수 있다. &lt;span style=&quot;color: #9d9d9d;&quot;&gt;(김기현님 강의를 보다가 알게 됨...)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;일반적으로(?)&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;parameter norm&lt;/span&gt;(L2)&lt;/b&gt;은 학습이 진행될 수록 커져야 한다.
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;모델이 복잡해 지면서...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;gradient norm&lt;/span&gt;(L2)&lt;/b&gt;는 점점 작아져야 한다.
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;grad norm이 크다? 그 만큼 많이 배우고 있다는 뜻. 학습이 진행되면서 점점 작아진다.&lt;/li&gt;
&lt;li&gt;학습 초반일 수록 틀리는 것이 많고, 많이 틀릴 수록 기울기가 가팔라짐.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1619454595133&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;@torch.no_grad()
def get_grad_norm(parameters, norm_type=2):
    parameters = list(filter(lambda p: p.grad is not None, parameters))

    total_norm = 0

    try:
        for p in parameters:
            total_norm += (p.grad.data**norm_type).sum()
        total_norm = total_norm ** (1. / norm_type)
    except Exception as e:
        print(e)

    return total_norm


@torch.no_grad()
def get_parameter_norm(parameters, norm_type=2):
    total_norm = 0

    try:
        for p in parameters:
            total_norm += (p.data**norm_type).sum()
        total_norm = total_norm ** (1. / norm_type)
    except Exception as e:
        print(e)

    return total_norm&lt;/code&gt;&lt;/pre&gt;</description>
      <category>DL&amp;amp;ML/code.data.tips</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/95</guid>
      <comments>https://aimaster.tistory.com/95#entry95comment</comments>
      <pubDate>Tue, 27 Apr 2021 01:45:32 +0900</pubDate>
    </item>
    <item>
      <title>ArcFace Loss</title>
      <link>https://aimaster.tistory.com/93</link>
      <description>&lt;p&gt;유사 이미지, 유사 텍스트를 찾는 태스크를 건들여보고 있는데,&lt;/p&gt;
&lt;p&gt;이 때&lt;b&gt; 입력을 잘 표현하는 임베딩을 학습&lt;/b&gt;하는 방법이 필요했다. (클러스터링에 활용할...)&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;arcface에 대해선 이전에 들어보기는 했지만, 실제로 어떻게 동작하는지도 잘 모르겠고&lt;/p&gt;
&lt;p&gt;뭐, 대충 메트릭 러닝이라곤 들었는데, 메트릭 러닝이라고 하면 유일하게 들어 본 것이&lt;/p&gt;
&lt;p&gt;triplet loss 정도...? 였다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;아는게 triplet loss이다 보니, arcface도 비슷하게 동작/구현 되지 않을까? 라는&lt;/p&gt;
&lt;p&gt;편견에 사로 잡혀 코드를 이해하는데 한참 걸렸다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;아래 코드를 보면 알겠지만, triplet loss처럼 입력으로&lt;/p&gt;
&lt;p&gt;여러 비교 대상(anchor, positive, negative)이 들어오지 않고&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;단일 입력(+정답 라벨)을 기대&lt;/span&gt;하기 때문이다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;코드 출처 (&lt;a href=&quot;https://github.com/wujiyang/Face_Pytorch/blob/master/margin/ArcMarginProduct.py&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;github.com/wujiyang/Face_Pytorch/blob/master/margin/ArcMarginProduct.py&lt;/a&gt;)&lt;/p&gt;
&lt;pre id=&quot;code_1619280241079&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import math
import torch
from torch import nn
from torch.nn import Parameter
import torch.nn.functional as F

class ArcMarginProduct(nn.Module):
    def __init__(self, in_feature=128, out_feature=10575, s=32.0, m=0.50, easy_margin=False):
        super(ArcMarginProduct, self).__init__()
        self.in_feature = in_feature
        self.out_feature = out_feature
        self.s = s
        self.m = m
        self.weight = Parameter(torch.Tensor(out_feature, in_feature))
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)

        # make the function cos(theta+m) monotonic decreasing while theta in [0&amp;deg;,180&amp;deg;]
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, x, label):
        # cos(theta)
        cosine = F.linear(F.normalize(x), F.normalize(self.weight))
        # cos(theta + m)
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        phi = cosine * self.cos_m - sine * self.sin_m

        if self.easy_margin:
            phi = torch.where(cosine &amp;gt; 0, phi, cosine)
        else:
            phi = torch.where((cosine - self.th) &amp;gt; 0, phi, cosine - self.mm)
        
        one_hot = torch.zeros_like(cosine)
        one_hot.scatter_(1, label.view(-1, 1), 1)
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output = output * self.s

        return output&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;결론 부터 말하면&lt;/p&gt;
&lt;p&gt;arcface는 &lt;span style=&quot;color: #ee2323;&quot;&gt;&lt;b&gt;분류&lt;/b&gt; 문제를 학습할 때&lt;/span&gt; 사용&lt;span style=&quot;color: #9d9d9d;&quot;&gt;(위 코드의 &lt;b&gt;out_feature&lt;/b&gt;는 &lt;b&gt;클래스&lt;/b&gt; &lt;b&gt;개수&lt;/b&gt;이다)&lt;/span&gt;되며,&lt;/p&gt;
&lt;p&gt;분류 학습의 부산물로 &lt;span style=&quot;color: #006dd7;&quot;&gt;클래스 간에는 확실한 분별력을, 클래스 내에선 응집력을 갖는 임베딩&lt;/span&gt;을 학습할 수 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;따라서, arcface를 이용해 사람 얼굴 이미지에 대한 의미 있는 표현(임베딩)을 얻고 싶다면&lt;/p&gt;
&lt;p&gt;A, B, C ... Z class(인물명 혹은 아이디) 각각에 대해 여러 개의 이미지를 마련한 뒤&lt;/p&gt;
&lt;p&gt;arcface와 softmax를 결합하여 분류 모델을 학습해야 한다.&lt;/p&gt;
&lt;p&gt;학습한 뒤에 임베딩만 필요하다면,&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;분류 레이어는 떼어 내고 임베딩만 활용&lt;/span&gt;하면 되는 것이다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;ArcFace의 동작 방식&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-25 오전 1.30.29.png&quot; data-origin-width=&quot;2392&quot; data-origin-height=&quot;822&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c9VD2d/btq3l39fFMl/z10IAqEepA3EkvkizocjS1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c9VD2d/btq3l39fFMl/z10IAqEepA3EkvkizocjS1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c9VD2d/btq3l39fFMl/z10IAqEepA3EkvkizocjS1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc9VD2d%2Fbtq3l39fFMl%2Fz10IAqEepA3EkvkizocjS1%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-25 오전 1.30.29.png&quot; data-origin-width=&quot;2392&quot; data-origin-height=&quot;822&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;빠른 이해를 위해 다음 &lt;span style=&quot;color: #ee2323;&quot;&gt;몇 가지를 이해&lt;span style=&quot;color: #000000;&quot;&gt;하면 좋다&lt;/span&gt;&lt;/span&gt;.&lt;br /&gt;&lt;span style=&quot;color: #666666;&quot;&gt;(아래 것들을 놓치고 있어서 이해하는데 오래걸림)&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특정 클래스를 의미하는 벡터들이&lt;span style=&quot;color: #006dd7;&quot;&gt; 클래스 개수 만큼&lt;/span&gt; 있고, 입력이 들어왔을 때 얻어지는 벡터를 내적하여 각도를 구할 것임
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;정답 클래스와의 각도가 최소화 되도록 학습&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;분류 모델을 학습할 때, softmax를 거쳐 확률 분포로 바꾼 뒤 CrossEntropy와 결합 되어 학습 될 때
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;정답 위치에 해당하는 확률을 최대화&lt;/span&gt; (1.0에 가깝게) 만들려고 한다.
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;여기서 중요한 것은 softmax에 들어가는 &lt;span style=&quot;color: #006dd7;&quot;&gt;입력이 cosine(theta)&lt;/span&gt; 라는 것
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;각도(theta)가 0(정답과 예측이 완전히 일치) 될 수록 cosine 값은 커진다 (1에 가까워짐)&lt;/li&gt;
&lt;li&gt;따라서 CrossEntropy가 정답 위치를 &lt;span style=&quot;color: #ee2323;&quot;&gt;최대화&lt;/span&gt; 하는 과정에서
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;특정 입력의 임베딩이 정답 클래스 임베딩과 각도(theta)는 &lt;span style=&quot;color: #ee2323;&quot;&gt;최소화&lt;/span&gt; 되도록 학습&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;theta에 m(margin)을 더하는 것의 의미
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;특정 입력을 넣었을 때 나오는 임베딩과의 정답 클래스 임베딩의 각도를 현재 계산된 것 보다 조금 더 멀게 설정
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;어차피 멀어진 각도는 CrossEntropy에 의해 최적화 될 때 최소화 됨&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;이제 논문에 나오는&lt;b&gt; &lt;span style=&quot;color: #006dd7;&quot;&gt;그림(figure.2) 설명과 코드를 대조&lt;/span&gt;&lt;/b&gt;해가며 한줄 한줄 읽어보면 이해가 쉽게 갈 것이다.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;self.weight은 (입력 차원 x 클래스 개수)로 이루어진 메트릭스&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;input과 self.weight을 각각 normalize 해줌으로써 길이 1인 구 위에 위치할 수 있게 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;sine을 구하는 이유?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;삼각함수의 덧셈 정리 &lt;b&gt;cos(x + y) = cosx * cosy + sinx * siny&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;cos(theta + m)을 구하기 위해서 미리 구해놔야함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;sine은 어떻게 구함?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;피타고라스 공식을 이용하면 &lt;b&gt;코사인 제곱 + 사인 제곱 = 1&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;easy_margin은 뭐임?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;&lt;a style=&quot;letter-spacing: 0px; color: #9d9d9d;&quot; href=&quot;https://github.com/ronghuaiyang/arcface-pytorch/issues/2&quot;&gt;github.com/ronghuaiyang/arcface-pytorch/issues/2&lt;/a&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;&amp;nbsp;참고&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;letter-spacing: 0px; color: #9d9d9d;&quot;&gt;the purpose of the easy margin is not to consider the theta + m &amp;gt; pi&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;letter-spacing: 0px; color: #9d9d9d;&quot;&gt;참고로 그림에 있는 arccos는 실제 구현 코드엔 등장할 필요가 없다. 왜 cos theta에 cosine의 역함수인 arccos를 적용시켜 theta로 역변환 할 필요가 없는지는 각자 직접 생각해보자.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;참고자료&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;figure id=&quot;og_1619279927375&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;article&quot; data-og-title=&quot;ArcFace: Additive Angular Margin Loss for Deep Face Recognition(2019) review&quot; data-og-description=&quot;Face Recognition(얼굴 인식)분야에서 사용되는 Loss인 ArcFace loss에 대한 논문이다. 얼굴 인식 분야를 공부하는 것은 아니나, 다른 논문을 읽다가 loss function으로 ArcFace loss를 활용하는 논문이 있어 해당&quot; data-og-host=&quot;cumulu-s.tistory.com&quot; data-og-source-url=&quot;https://cumulu-s.tistory.com/9&quot; data-og-url=&quot;https://cumulu-s.tistory.com/9&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/ckq5I9/hyJY69cb2r/twM1moBZzvHhzDrmOu4D01/img.png?width=405&amp;amp;height=446&amp;amp;face=0_0_405_446,https://scrap.kakaocdn.net/dn/CFNaW/hyJY6VDH3a/4HLPGpuKNSNIzXN5XlMka1/img.png?width=405&amp;amp;height=446&amp;amp;face=0_0_405_446,https://scrap.kakaocdn.net/dn/fybkz/hyJZaKwrY2/qVAewZpGpkuyQp6WmbB9Mk/img.png?width=1493&amp;amp;height=552&amp;amp;face=0_0_1493_552&quot;&gt;&lt;a href=&quot;https://cumulu-s.tistory.com/9&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://cumulu-s.tistory.com/9&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/ckq5I9/hyJY69cb2r/twM1moBZzvHhzDrmOu4D01/img.png?width=405&amp;amp;height=446&amp;amp;face=0_0_405_446,https://scrap.kakaocdn.net/dn/CFNaW/hyJY6VDH3a/4HLPGpuKNSNIzXN5XlMka1/img.png?width=405&amp;amp;height=446&amp;amp;face=0_0_405_446,https://scrap.kakaocdn.net/dn/fybkz/hyJZaKwrY2/qVAewZpGpkuyQp6WmbB9Mk/img.png?width=1493&amp;amp;height=552&amp;amp;face=0_0_1493_552');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;ArcFace: Additive Angular Margin Loss for Deep Face Recognition(2019) review&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;Face Recognition(얼굴 인식)분야에서 사용되는 Loss인 ArcFace loss에 대한 논문이다. 얼굴 인식 분야를 공부하는 것은 아니나, 다른 논문을 읽다가 loss function으로 ArcFace loss를 활용하는 논문이 있어 해당&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;cumulu-s.tistory.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;figure id=&quot;og_1619279933359&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;website&quot; data-og-title=&quot;ArcFace: Additive Angular Margin Loss for Deep Face Recognition.&quot; data-og-description=&quot;각도의 경우 rotaion-invariant, scale-invarint 속성이 보장된다.&quot; data-og-host=&quot;norman3.github.io&quot; data-og-source-url=&quot;https://norman3.github.io/papers/docs/arcface.html&quot; data-og-url=&quot;https://norman3.github.io/papers/docs/arcface.html&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/5qEVW/hyJZfSy67f/zMG9FAjAgrEyGaKI2SUHCk/img.png?width=2690&amp;amp;height=472&amp;amp;face=0_0_2690_472,https://scrap.kakaocdn.net/dn/cpmG4G/hyJZeF9gYo/cZDxirBoJKVSdpR0UZkUkK/img.png?width=1090&amp;amp;height=738&amp;amp;face=0_0_1090_738,https://scrap.kakaocdn.net/dn/DitWW/hyJZi2RY9K/RZvXJdmxTqqzDubpNh44MK/img.png?width=1194&amp;amp;height=632&amp;amp;face=0_0_1194_632&quot;&gt;&lt;a href=&quot;https://norman3.github.io/papers/docs/arcface.html&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://norman3.github.io/papers/docs/arcface.html&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/5qEVW/hyJZfSy67f/zMG9FAjAgrEyGaKI2SUHCk/img.png?width=2690&amp;amp;height=472&amp;amp;face=0_0_2690_472,https://scrap.kakaocdn.net/dn/cpmG4G/hyJZeF9gYo/cZDxirBoJKVSdpR0UZkUkK/img.png?width=1090&amp;amp;height=738&amp;amp;face=0_0_1090_738,https://scrap.kakaocdn.net/dn/DitWW/hyJZi2RY9K/RZvXJdmxTqqzDubpNh44MK/img.png?width=1194&amp;amp;height=632&amp;amp;face=0_0_1194_632');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;ArcFace: Additive Angular Margin Loss for Deep Face Recognition.&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;각도의 경우 rotaion-invariant, scale-invarint 속성이 보장된다.&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;norman3.github.io&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;figure id=&quot;og_1619411932312&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;article&quot; data-og-title=&quot;Metric Learning 이란 - 학습 방법(Loss)&quot; data-og-description=&quot;*크롬으로 보시는 걸 추천드립니다* 본 &amp;quot;Metric Learning 이란 - 학습 방법(Loss)&amp;quot;를 보시기 전에 &amp;nbsp;1) Metric Learning 이란 - 기본 &amp;nbsp;2) [논문요약] Deep Face Recognition : A Survey - ① 탄 순서로 먼저 보시..&quot; data-og-host=&quot;kmhana.tistory.com&quot; data-og-source-url=&quot;https://kmhana.tistory.com/17&quot; data-og-url=&quot;https://kmhana.tistory.com/17&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/9Fudm/hyJ0qy8HDi/kJEeOtdc9GdiJ8zx9fzUw0/img.png?width=800&amp;amp;height=338&amp;amp;face=0_0_800_338,https://scrap.kakaocdn.net/dn/bwUSCc/hyJZe77Lkm/gvvcbwNC1FVtk8VY7xQz71/img.png?width=800&amp;amp;height=338&amp;amp;face=0_0_800_338,https://scrap.kakaocdn.net/dn/dvHXRt/hyJ0rLzs5p/fjo4kpmCrUloWbkzC6afFk/img.png?width=1184&amp;amp;height=501&amp;amp;face=0_0_1184_501&quot;&gt;&lt;a href=&quot;https://kmhana.tistory.com/17&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://kmhana.tistory.com/17&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/9Fudm/hyJ0qy8HDi/kJEeOtdc9GdiJ8zx9fzUw0/img.png?width=800&amp;amp;height=338&amp;amp;face=0_0_800_338,https://scrap.kakaocdn.net/dn/bwUSCc/hyJZe77Lkm/gvvcbwNC1FVtk8VY7xQz71/img.png?width=800&amp;amp;height=338&amp;amp;face=0_0_800_338,https://scrap.kakaocdn.net/dn/dvHXRt/hyJ0rLzs5p/fjo4kpmCrUloWbkzC6afFk/img.png?width=1184&amp;amp;height=501&amp;amp;face=0_0_1184_501');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;Metric Learning 이란 - 학습 방법(Loss)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;*크롬으로 보시는 걸 추천드립니다* 본 &quot;Metric Learning 이란 - 학습 방법(Loss)&quot;를 보시기 전에 &amp;nbsp;1) Metric Learning 이란 - 기본 &amp;nbsp;2) [논문요약] Deep Face Recognition : A Survey - ① 탄 순서로 먼저 보시..&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;kmhana.tistory.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;figure id=&quot;og_1619412453733&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;article&quot; data-og-title=&quot;메트릭러닝 기반 안경 검색 서비스  개발기(2)&quot; data-og-description=&quot;본 글은 AI 가상피팅 기반 안경쇼핑앱 &amp;lsquo;라운즈&amp;rsquo;에 최근 추가된 안경 검색 서비스 &amp;lsquo;Glass Finder&amp;rsquo;의 개발기를 공유하고자 작성된 글입니다. 지난 1부에서는 메트릭 러닝 기반 안경 검색 프로젝트&quot; data-og-host=&quot;blog.est.ai&quot; data-og-source-url=&quot;https://blog.est.ai/2020/02/%EB%A9%94%ED%8A%B8%EB%A6%AD%EB%9F%AC%EB%8B%9D-%EA%B8%B0%EB%B0%98-%EC%95%88%EA%B2%BD-%EA%B2%80%EC%83%89-%EC%84%9C%EB%B9%84%EC%8A%A4-%EA%B0%9C%EB%B0%9C%EA%B8%B02/&quot; data-og-url=&quot;https://blog.est.ai/2020/02/%eb%a9%94%ed%8a%b8%eb%a6%ad%eb%9f%ac%eb%8b%9d-%ea%b8%b0%eb%b0%98-%ec%95%88%ea%b2%bd-%ea%b2%80%ec%83%89-%ec%84%9c%eb%b9%84%ec%8a%a4-%ea%b0%9c%eb%b0%9c%ea%b8%b02/&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/32SLJ/hyJY7uaO0J/mqW24huViC9IXIi25jqPd0/img.png?width=760&amp;amp;height=428&amp;amp;face=314_162_399_254,https://scrap.kakaocdn.net/dn/CyqIX/hyJZbXFppT/sfOqhVrXDzRQq9iZv9B3gK/img.png?width=941&amp;amp;height=793&amp;amp;face=0_0_941_793,https://scrap.kakaocdn.net/dn/bhVE0p/hyJ0x5StrL/N3OJFhX2L1mUJK56dJSDVK/img.png?width=760&amp;amp;height=428&amp;amp;face=314_162_399_254&quot;&gt;&lt;a href=&quot;https://blog.est.ai/2020/02/%EB%A9%94%ED%8A%B8%EB%A6%AD%EB%9F%AC%EB%8B%9D-%EA%B8%B0%EB%B0%98-%EC%95%88%EA%B2%BD-%EA%B2%80%EC%83%89-%EC%84%9C%EB%B9%84%EC%8A%A4-%EA%B0%9C%EB%B0%9C%EA%B8%B02/&quot; data-source-url=&quot;https://blog.est.ai/2020/02/%EB%A9%94%ED%8A%B8%EB%A6%AD%EB%9F%AC%EB%8B%9D-%EA%B8%B0%EB%B0%98-%EC%95%88%EA%B2%BD-%EA%B2%80%EC%83%89-%EC%84%9C%EB%B9%84%EC%8A%A4-%EA%B0%9C%EB%B0%9C%EA%B8%B02/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/32SLJ/hyJY7uaO0J/mqW24huViC9IXIi25jqPd0/img.png?width=760&amp;amp;height=428&amp;amp;face=314_162_399_254,https://scrap.kakaocdn.net/dn/CyqIX/hyJZbXFppT/sfOqhVrXDzRQq9iZv9B3gK/img.png?width=941&amp;amp;height=793&amp;amp;face=0_0_941_793,https://scrap.kakaocdn.net/dn/bhVE0p/hyJ0x5StrL/N3OJFhX2L1mUJK56dJSDVK/img.png?width=760&amp;amp;height=428&amp;amp;face=314_162_399_254');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;메트릭러닝 기반 안경 검색 서비스 개발기(2)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;본 글은 AI 가상피팅 기반 안경쇼핑앱 &amp;lsquo;라운즈&amp;rsquo;에 최근 추가된 안경 검색 서비스 &amp;lsquo;Glass Finder&amp;rsquo;의 개발기를 공유하고자 작성된 글입니다. 지난 1부에서는 메트릭 러닝 기반 안경 검색 프로젝트&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;blog.est.ai&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;</description>
      <category>DL&amp;amp;ML/concept</category>
      <category>arcface</category>
      <category>metric learning</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/93</guid>
      <comments>https://aimaster.tistory.com/93#entry93comment</comments>
      <pubDate>Sun, 25 Apr 2021 01:54:02 +0900</pubDate>
    </item>
    <item>
      <title>DistilBERT (a distilled version of BERT: smaller, faster, cheaper and lighter)</title>
      <link>https://aimaster.tistory.com/92</link>
      <description>&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1910.01108&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;arxiv.org/abs/1910.01108&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1618978030337&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;website&quot; data-og-title=&quot;DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter&quot; data-og-description=&quot;As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In t&quot; data-og-host=&quot;arxiv.org&quot; data-og-source-url=&quot;https://arxiv.org/abs/1910.01108&quot; data-og-url=&quot;https://arxiv.org/abs/1910.01108v4&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/1910.01108&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://arxiv.org/abs/1910.01108&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In t&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;arxiv.org&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p&gt;&lt;b&gt;Knowledge Distillation&lt;/b&gt;에 대해 훑어 보고 있는데, KD는 먼 곳에 있지 않았다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;BERT에 KD를 적용한 게 DistilBERT...&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;KD에 관해서는 서베이 논문을 훑어보고 있는데, 따로 정리할 예정이다.&lt;/p&gt;
&lt;p&gt;그 이전에 당장 버트에는 KD가 어떻게 활용되었는지 궁금했다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;여기서 중요한 것은, 특정 태스크 전용으로 KD를 적용하는게 아니라 (QA model, STS, ...)&lt;/p&gt;
&lt;p&gt;Pre-Training 단계에서 부터 KD를 적용해서&lt;/p&gt;
&lt;p&gt;&lt;b&gt;General Purpose Language Representation Model&lt;/b&gt;을 만들 수 있다는 점!&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;뭐, 이렇게 해서 기존 BERT 대비 사이즈는 40%,&lt;/p&gt;
&lt;p&gt;NLU 능력은 97% 유지, 속도는 60% 빨라 졌다고 함.&lt;/p&gt;
&lt;p&gt;latency에 민감한 서비스를 구성할 때는 유용하게 활용될 방법이다.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;학습 방법 (Knowledge Distillation)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;KD는&lt;b&gt; 모델 압축 기술&lt;/b&gt; 중 하나이며,&lt;/p&gt;
&lt;p&gt;큰 모델(&lt;span style=&quot;color: #ee2323;&quot;&gt;Teacher&lt;/span&gt;), 작은 모델(&lt;span style=&quot;color: #ee2323;&quot;&gt;Student&lt;/span&gt;)을 두어 학생이 선생의 동작 방식을 배울 수 있도록 한다.&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #9d9d9d;&quot;&gt;(cf. ALBERT는 embedding을 더 작게 분해하고, layer간의 weight을 공유함으로써 모델을 압축함)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;학생은 선생의 '&lt;b&gt;soft target probability&lt;/b&gt;'를 배운다.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-23 오전 12.11.30.png&quot; data-origin-width=&quot;256&quot; data-origin-height=&quot;32&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/UZTzJ/btq3bVXSvRh/z2aQOn2MLboUFsALrCMFv0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/UZTzJ/btq3bVXSvRh/z2aQOn2MLboUFsALrCMFv0/img.png&quot; data-alt=&quot;t_i는 선생의 출력 확률 분포&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/UZTzJ/btq3bVXSvRh/z2aQOn2MLboUFsALrCMFv0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUZTzJ%2Fbtq3bVXSvRh%2Fz2aQOn2MLboUFsALrCMFv0%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-23 오전 12.11.30.png&quot; data-origin-width=&quot;256&quot; data-origin-height=&quot;32&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;t_i는 선생의 출력 확률 분포&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;기존의 CrossEntropy를 이용한 학습은&lt;/p&gt;
&lt;p&gt;모델의 예측 확률 분포를 정답 one-hot 분포에 맞추어 (정답 위치 확률 최대화) 학습하게 된다.&lt;/p&gt;
&lt;p&gt;학습 셋에 잘 피팅이 된 모델이라면 특정 클래스 확률은 높고 나머지는 거의 zero 에 가까운 확률 분포를 출력하게 된다.&lt;/p&gt;
&lt;p&gt;이 때 &lt;span style=&quot;color: #006dd7;&quot;&gt;모델의 일반화(Generalization)능력에 기여하는 부분은 바로 &lt;b&gt;'near-zero'&lt;/b&gt; 부분&lt;/span&gt; 이라고 논문에서 언급하고 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;따라서 BERT를 KD할 때,&lt;/p&gt;
&lt;p&gt;선생 모델이 출력하는&lt;b&gt; 확률 분포 자체를 배움&lt;/b&gt;으로써&lt;/p&gt;
&lt;p&gt;학생 모델이 자신 보다 복잡한 모델들만이 배울 수 있는 signal 또한 함께 배울 수 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-23 오전 12.11.50.png&quot; data-origin-width=&quot;214&quot; data-origin-height=&quot;50&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/GzZZl/btq3cTLZX43/NwyQyk5xy3228RNullpKP0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/GzZZl/btq3cTLZX43/NwyQyk5xy3228RNullpKP0/img.png&quot; data-alt=&quot;softmax-temparature&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/GzZZl/btq3cTLZX43/NwyQyk5xy3228RNullpKP0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FGzZZl%2Fbtq3cTLZX43%2FNwyQyk5xy3228RNullpKP0%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-23 오전 12.11.50.png&quot; data-origin-width=&quot;214&quot; data-origin-height=&quot;50&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;softmax-temparature&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;여기서 T를 도입하면 분포의 smoothness를 조정할 수 있다.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;T는 학습 과정에서 학생/선생 모두에게 적용되며 추론시엔 제외 시킨다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Final Training Objective&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;위의 내용을 종합하여 최종 Training Objective를 정의하면 다음과 같다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Final Training Objective&lt;/b&gt; = &lt;span style=&quot;color: #ee2323;&quot;&gt;Distillation Loss(CE) + Masked Language Modeling Loss + Cosine Embedding Loss&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;(마지막 코사인 임베딩 로스는 학생/선생의 히든 벡터가 바라보는 방향을 일치 시켜주는데 도움을 주는 로스다)&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Detail&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Student 모델은 BERT에서 token-type embedding 및 pooler를 제거 + 레이어 개수 1/2(?) 을 줄인 버전의 모델이다.&lt;/li&gt;
&lt;li&gt;Student의 weight 초깃값은 Teacher 모델의 weight을 이용하여 초기화 함
&lt;ul style=&quot;list-style-type: disc;&quot;&gt;
&lt;li&gt;레이어 개수를 절반으로 줄였으므로 동일 위치 레이어 + 인접 레이어 중 한 레이어 weight을 취한 듯&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;배치는 4k 로 구성 되었고, dynamic masking + NSP objective 로 학습 되었다.&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>DL&amp;amp;ML/papers</category>
      <category>KD</category>
      <category>Knowledge Distiliation</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/92</guid>
      <comments>https://aimaster.tistory.com/92#entry92comment</comments>
      <pubDate>Fri, 23 Apr 2021 00:31:27 +0900</pubDate>
    </item>
    <item>
      <title>ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators</title>
      <link>https://aimaster.tistory.com/91</link>
      <description>&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2003.10555&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;arxiv.org/abs/2003.10555&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Electra 모델은 어떻게 학습이 되는지 알아보자.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;ELECTRA?&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;Masked Language Modeling(MLM) pre-training 방법은 입력 일부를 [MASK] 토큰으로 변경해버린 뒤 원래 토큰을 복원하는 식으로 학습을 한다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;그런데 이게 과연 효율적인가? 라는 의문에서 Electra의 아이디어가 나왔다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;마스킹을 할 때 15% 정도의 확률로 선택을 하고, 마스킹 된 것을 원본으로 복원하는 것을 학습하는데, 하나의 Example 당 &lt;span style=&quot;color: #006dd7;&quot;&gt;15% 토큰만 학습에 기여&lt;/span&gt;하기 때문에 계산 효율적이지 못하다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.01.56.png&quot; data-origin-width=&quot;1122&quot; data-origin-height=&quot;494&quot; width=&quot;551&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/rEsdK/btq2IPPut9U/GQXfPR1lhhGXnWooiyjELK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/rEsdK/btq2IPPut9U/GQXfPR1lhhGXnWooiyjELK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/rEsdK/btq2IPPut9U/GQXfPR1lhhGXnWooiyjELK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrEsdK%2Fbtq2IPPut9U%2FGQXfPR1lhhGXnWooiyjELK%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.01.56.png&quot; data-origin-width=&quot;1122&quot; data-origin-height=&quot;494&quot; width=&quot;551&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Electra에선 &lt;span style=&quot;color: #ee2323;&quot;&gt;Replaced Token Detection (RTD)&lt;/span&gt; 방식의 pre-training 방법을 제안한다.&lt;/p&gt;
&lt;p&gt;단순히 특정 토큰을 [MASK]로 마스킹 해버리는 것이 아니라,&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;그럴듯한 단어로 바꿔버리는 것!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Generator&lt;/span&gt;로 &lt;b&gt;작은 크기(&lt;/b&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;작은 크기로 둬야 그럴듯 하게 실수를 하니까?&lt;/span&gt;&lt;b&gt;)&lt;/b&gt;의 MLM를 두고,&lt;/p&gt;
&lt;p&gt;[MASK] 표시가 된 입력을 넣으면 그럴듯한 단어로 바뀌어 출력 되는데&lt;/p&gt;
&lt;p&gt;이 출력을 &lt;span style=&quot;color: #ee2323;&quot;&gt;Discriminator&lt;/span&gt;에 넣어,&amp;nbsp; '모든 토큰에 대해' &lt;span style=&quot;color: #333333;&quot;&gt;바뀌었는지, 안바뀌었는지 여부를&amp;nbsp;&lt;/span&gt;분류한다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;기존의 MLM은 학습 단계에서 사용하는 [MASK] 토큰이&lt;/p&gt;
&lt;p&gt;다운 스트림 태스크에서 fine-tuning 될 때는 등장하지 않아&lt;/p&gt;
&lt;p&gt;네트워크 입력의 mis-match가 발생하는 문제가 있었지만, Electra는 그런 문제가 없다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;어쨌든 RTD의 장점은 바로,&lt;/p&gt;
&lt;p&gt;모델이 &lt;span style=&quot;color: #006dd7;&quot;&gt;모든 입력 토큰으로 부터&lt;/span&gt; knowledge를 (빠르고, 효율적으로)&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;배울 수 있다&lt;/span&gt;는 것이다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Generator&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;제너레이터는 MLM 으로 학습이 되며, 주어진 입력 x=[x1, x2, ... , xn]에 대해&lt;/p&gt;
&lt;p&gt;ceil(n*0.15) 개 만큼 [MASK]을 하고, 마스킹 된 토큰이 원래 무엇이었는지를 학습한다.&lt;/p&gt;
&lt;p&gt;그리고, 마스킹 된 부분이 원래 뭐였는지 복구된 문장을 currupted example이라고 하자.&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.23.22.png&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;92&quot; width=&quot;417&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b5Ad9K/btq2IQHDGVC/RN7oNer8uOsglQMyx1Ms91/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b5Ad9K/btq2IQHDGVC/RN7oNer8uOsglQMyx1Ms91/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b5Ad9K/btq2IQHDGVC/RN7oNer8uOsglQMyx1Ms91/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb5Ad9K%2Fbtq2IQHDGVC%2FRN7oNer8uOsglQMyx1Ms91%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.23.22.png&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;92&quot; width=&quot;417&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Discriminator&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;generator에 의해 생성된 currupted example을 입력으로 받아 각 토큰이 변형되었는지 아닌지를 학습한다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;각각의 로스 함수는 다음과 같이 정의 된다.&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.24.40.png&quot; data-origin-width=&quot;1146&quot; data-origin-height=&quot;182&quot; width=&quot;534&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dSelfo/btq2HtT0M3F/OBbelsFDcDO2yrXVbMDhd1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dSelfo/btq2HtT0M3F/OBbelsFDcDO2yrXVbMDhd1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dSelfo/btq2HtT0M3F/OBbelsFDcDO2yrXVbMDhd1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdSelfo%2Fbtq2HtT0M3F%2FOBbelsFDcDO2yrXVbMDhd1%2Fimg.png&quot; data-filename=&quot;스크린샷 2021-04-16 오전 1.24.40.png&quot; data-origin-width=&quot;1146&quot; data-origin-height=&quot;182&quot; width=&quot;534&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>DL&amp;amp;ML/papers</category>
      <category>Electra</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/91</guid>
      <comments>https://aimaster.tistory.com/91#entry91comment</comments>
      <pubDate>Fri, 16 Apr 2021 01:28:19 +0900</pubDate>
    </item>
    <item>
      <title>Multi-Sample Dropout for Accelerated Training and Better Generalization</title>
      <link>https://aimaster.tistory.com/90</link>
      <description>&lt;p&gt;&lt;a href=&quot;https://arxiv.org/pdf/1905.09788.pdf&quot;&gt;https://arxiv.org/pdf/1905.09788.pdf&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;NLP 관련 캐글 상위권 솔루션들을 보다보면 간혹 등장하는 multi-sample dropout 구조를 이용해&lt;/p&gt;
&lt;p&gt;모델의 일반화 능력을 향상 시키는 것을 볼 수 있다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;관련 논문이 있어 아이디어 정도만 정리해본다.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;&lt;b&gt;Dropout의 효과 리마인드&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li id=&quot;SE-c867db8e-6b34-4542-9918-d86d68d2d867&quot;&gt;&lt;span&gt;예를 들어, 랜덤하게 50%의 뉴런을 매 학습 이터레이션 마다 버림&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;그 결과, 뉴런들이 서로 의존하는 것을 막을 수 있고, better generalization이 가능해짐&lt;/li&gt;
&lt;li&gt;inference 시에는 학습 때 처럼 랜덤하게 버리지 않고, 각 뉴런의 출력에 0.5를 곱함.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #009a87;&quot;&gt;Multi-sample Dropout&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-origin-width=&quot;0&quot; data-origin-height=&quot;0&quot; width=&quot;504&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/JsZGg/btq2C4nrm8c/6yudey0ffh3mkK6aVybIRK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/JsZGg/btq2C4nrm8c/6yudey0ffh3mkK6aVybIRK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/JsZGg/btq2C4nrm8c/6yudey0ffh3mkK6aVybIRK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJsZGg%2Fbtq2C4nrm8c%2F6yudey0ffh3mkK6aVybIRK%2Fimg.png&quot; data-origin-width=&quot;0&quot; data-origin-height=&quot;0&quot; width=&quot;504&quot; height=&quot;NaN&quot; data-ke-mobilestyle=&quot;widthContent&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&lt;span&gt;이게 전부다. &lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-b9cfef5b-0571-463a-aba4-700a8965b867&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-27f26595-973c-4e33-9361-ce2a26e2d6e1&quot;&gt;&lt;span&gt;BERT를 fine-tuning할 때를 예를 들면,&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-bea52a70-d91d-4d72-9616-4fe76b7ab05a&quot;&gt;&lt;span&gt;BERT의 output feature에 대해서 k 개의 dropout을 적용하고,&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-97fb85bf-dc7c-4509-bda8-617e97e20e14&quot;&gt;&lt;span&gt;각 결과에 down stream task 해결을 위한 head를 붙여 최종 출력 값을 뽑고 각각에 대한 로스를 구한 뒤 평균 내는 것.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-85d9145d-eee5-4110-a399-68bf83d3cb55&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-ab2889bd-a07a-4587-b36f-b7abe32cd9e9&quot;&gt;&lt;span&gt;그림에서는 2개의 dropout samples 을 보여줬지만,&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;논문에선 64 samples 까지 시도한다.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-3bd96040-13b8-45bc-b593-bc6c752925d1&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-7ce7f8ff-a4ee-4257-b73e-323d26d7007b&quot;&gt;&lt;span&gt;Multi-sample dropout은 &lt;span style=&quot;color: #006dd7;&quot;&gt;학습 속도를 가속&lt;/span&gt;&lt;/span&gt;&lt;span&gt;시켜준다는데&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-3ff4a62f-9279-472b-b8db-9ae0d48be061&quot;&gt;&lt;span&gt;(매 이터레이션 학습 속도는 느려지지만, 전체적으로 보면)&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-77cb121b-b6db-42b0-923d-9e685361e264&quot;&gt;&lt;span&gt;그 이유는, 같은 인풋에 대해서 서로 다른 output을 적용하여 k 개의 sample을 뽑기 때문에&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-f83cb448-2b5e-4d18-a2c7-dacc3bae6370&quot;&gt;&lt;span&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;mini-batch의 크기를 k개 만큼 뻥튀기 시키는 효과&lt;/span&gt;를 가져온다.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-e17a5935-9e89-4eef-86f9-99ca005372fa&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-7ee089ee-bb16-4447-9d48-c5b0f55846cd&quot;&gt;&lt;span&gt;다시 말해, 위의 그림 예시 기준으로 &lt;span style=&quot;color: #006dd7;&quot;&gt;&amp;lt;A, B&amp;gt; 라는 인풋에 대해서 &amp;lt;A, A', B, B'&amp;gt; 샘플로 학습하는 효과&lt;/span&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-f493d1d3-e15f-430d-a416-597bb0ba5979&quot;&gt;&lt;span&gt;물론, Dropout이 없어서 &amp;lt;A, A, B, B&amp;gt; 를 학습하게 되면,&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-5e1f07a8-3726-44ca-a4c1-932adc8389e5&quot;&gt;&lt;span&gt;즉, sample간의 diversity가 없어지게 되면서 multi-sample dropout 을 적용하는 의미가 없어진다.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-7febb6b2-dce5-47ff-ba4f-0be50eb567ab&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-8147a1ac-5814-464e-a076-8b78a2b275ca&quot;&gt;&lt;span&gt;직관적으로 보면 &lt;/span&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Self-Ensemble 효과&lt;/span&gt;&lt;span&gt;도 있다고한다.&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-ab400068-ae22-4eac-85a9-1fcc3085a832&quot;&gt;&lt;span&gt;​&lt;/span&gt;&lt;/p&gt;
&lt;p id=&quot;SE-89885797-aed2-4edb-8700-af9c0be9641f&quot;&gt;&lt;span&gt;실험결과 적정 dropout sample size는 &lt;b&gt;8, 16&lt;/b&gt; 정도가 합리적이라고 나오는데,&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;뭐 이건 각자 상황에 따라 다를듯!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;예시 코드&lt;/h3&gt;
&lt;pre id=&quot;code_1618486122822&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;outputs = self.roberta(
    input_ids,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    position_ids=position_ids,
    head_mask=head_mask,
    inputs_embeds=inputs_embeds,
)

hidden_layers = outputs[2]

cls_outputs = torch.stack(
    [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=2
)
cls_output = (torch.softmax(self.layer_weights, dim=0) * cls_outputs).sum(-1)

# multisample dropout (wut): https://arxiv.org/abs/1905.09788
logits = torch.mean(
    torch.stack(
        [self.classifier(self.high_dropout(cls_output)) for _ in range(5)],
        dim=0,
    ),
    dim=0,
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/master/step5_model3_roberta_code/model.py&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/master/step5_model3_roberta_code/model.py&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1618486277897&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-og-type=&quot;object&quot; data-og-title=&quot;oleg-yaroshevskiy/quest_qa_labeling&quot; data-og-description=&quot;Google QUEST Q&amp;amp;A Labeling. Improving automated understanding of complex question answer content - oleg-yaroshevskiy/quest_qa_labeling&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/master/step5_model3_roberta_code/model.py&quot; data-og-url=&quot;https://github.com/oleg-yaroshevskiy/quest_qa_labeling&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/YYrds/hyJTzwxaXD/M5o1akFIKNhNia0L8bxpyk/img.jpg?width=400&amp;amp;height=400&amp;amp;face=69_0_260_208&quot;&gt;&lt;a href=&quot;https://github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/master/step5_model3_roberta_code/model.py&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/master/step5_model3_roberta_code/model.py&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/YYrds/hyJTzwxaXD/M5o1akFIKNhNia0L8bxpyk/img.jpg?width=400&amp;amp;height=400&amp;amp;face=69_0_260_208');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot;&gt;oleg-yaroshevskiy/quest_qa_labeling&lt;/p&gt;
&lt;p class=&quot;og-desc&quot;&gt;Google QUEST Q&amp;amp;A Labeling. Improving automated understanding of complex question answer content - oleg-yaroshevskiy/quest_qa_labeling&lt;/p&gt;
&lt;p class=&quot;og-host&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>DL&amp;amp;ML/papers</category>
      <category>multi-sample dropout</category>
      <author>식피두</author>
      <guid isPermaLink="true">https://aimaster.tistory.com/90</guid>
      <comments>https://aimaster.tistory.com/90#entry90comment</comments>
      <pubDate>Thu, 15 Apr 2021 20:27:25 +0900</pubDate>
    </item>
  </channel>
</rss>