| 
            Research
             
              My research interest spans a wide range of deep generative models (AR, flow, GAN, diffusion,
              etc.) applied to sequential data. Specifically, I am working on building multi-modal large language models
              with a focus on audio.
  
              During my Ph.D., I focused on time-domain waveform data (speech and audio) to advance generative modeling for audio.
  
              I am also broadly interested in speech and audio applications, including text-to-speech, voice conversion, music generation, neural audio codecs, and audio language models.
  
              Representative papers are highlighted.
             
           | 
         
        
       
      
        
          
          
            | 
              
              
             | 
            
              
                ETTA: Elucidating the Design Space of Text-to-Audio Models
              
               
              Sang-gil Lee*,
              Zhifeng Kong*,
              Arushi Goel,
              Sungwon Kim,
              Rafael Valle,
              Bryan Catanzaro
               
              arXiv preprint, 2024
                Project Page /
              arXiv
              
              ETTA is the first text-to-audio model with emergent abilities, capable of synthesizing entirely novel, imaginative sounds beyond the real world by leveraging large-scale synthetic audio captions (AF-Synthetic). 
             | 
           
          
            | 
              
              
             | 
            
              
                BigVGAN: A Universal Neural Vocoder with Large-Scale Training
              
               
              Sang-gil Lee,
              Wei Ping,
              Boris Ginsburg,
              Bryan Catanzaro,
              Sungroh Yoon
               
              International Conference on Learning Representations (ICLR), 2023
                Project Page /
              Model  /
              arXiv /
              Code /
              Demo
              
              BigVGAN is a universal audio synthesizer that achieves unprecedented zero-shot performance on various
                unseen
                environments using anti-aliased periodic nonlinearity and large-scale training.  
             | 
           
          
            | 
              
              
             | 
            
              
                Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
              
               
              Edresson Casanova,
              Ryan Langman,
              Paarth Neekhara,
              Shehzeen Hussain,
              Jason Li,
              Subhankar Ghosh,
              Ante Jukić,
              Sang-gil Lee
               
              ICASSP, 2025
                Project Page /
              Model /
              arXiv /
              Code
              
              A neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. 
             | 
           
          
            | 
              
              
             | 
            
              
                Improving Text-To-Audio Models with Synthetic Captions
              
               
              Zhifeng Kong*,
              Sang-gil Lee*,
              Deepanway Ghosal,
              Navonil Majumder,
              Ambuj Mehrish,
              Rafael Valle,
              Soujanya Poria,
              Bryan Catanzaro
               
              Interspeech SynData4GenAI, 2024
                Dataset /
              Model /
              arXiv
              
              AF-AudioSet is a large-scale audio dataset featuring synthetic captions generated by Audio Flamingo, enabling significant improvements in text-to-audio models. 
             | 
           
          
            | 
              
              
             | 
            
              
                VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
              
               
              Heeseung Kim,
              Sang-gil Lee,
              Jiheum Yeom,      
              Che Hyun Lee,
              Sungwon Kim,
              Sungroh Yoon
               
              INTERSPEECH, 2024
                Project Page /
              arXiv
              
              VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner. 
             | 
                
        
          | 
            
            
           | 
          
            
              Edit-A-Video: Single Video Editing with Object-Aware Consistency
            
             
            Chaehun Shin*,
            Heeseung Kim*,
            Che Hyun Lee,
            Sang-gil Lee,
            Sungroh Yoon
             
            Asian Conference on Machine Learning (ACML), Best Paper Award, 2023
              Project Page /
            arXiv
            
            Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.  
           | 
         
        
          | 
            
            
           | 
          
            
              PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive
                Prior
              
            
             
            Sang-gil Lee,
            Heeseung Kim,
            Chaehun Shin,
            Xu Tan,
            Chang Liu,
            Qi Meng,
            Tao Qin,
            Wei Chen,
            Sungroh Yoon,
            Tie-Yan Liu
             
            International Conference on Learning Representations (ICLR), 2022
              Project Page /
            arXiv /
            Code /
            Poster
            
            PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for
              training and sampling from diffusion models applied to speech synthesis.  
           | 
         
        
          | 
            
            
           | 
          
            
              NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity
            
             
            Sang-gil Lee,
            Sungwon Kim,
            Sungroh Yoon
             
            Neural Information Processing Systems (NeurIPS), 2020
             
            arXiv /
            Code /
            Poster
            
            NanoFlow uses a single neural network for multiple transformation stages in normalizing flows, which
              provides an efficient compression for flow-based generative models. 
           | 
         
        
          | 
            
            
           | 
          
            
              FloWaveNet: A Generative Flow for Raw Audio
            
             
            Sungwon Kim,
            Sang-gil Lee,
            Jongyoon Song,
            Jaehyeon Kim,
            Sungroh Yoon
             
            International Conference on Machine Learning (ICML), 2019
             
            arXiv /
            Code /
            Demo /
            Poster
            
            FloWaveNet is one of the first flow-based generative models for fast and parallel synthesis of audio waveforms, enabling a likelihood-based neural vocoder without any auxiliary loss. 
           | 
         
        
          | 
            
            
           | 
          
            
              One-Shot Learning for Text-to-SQL Generation
            
             
            Dongjun Lee,
            Jaesik Yoon,
            Jongyoon Song,
            Sang-gil Lee,
            Sungroh Yoon
             
            arXiv preprint, 2019
             
            arXiv
            
            Template-based one-shot text-to-SQL generative model based on a Candidate Search Network & Pointer
              Network. 
           | 
         
        
          | 
            
            
           | 
          
            
              Polyphonic Music Generation with Sequence Generative Adversarial Networks
              
            
             
            Sang-gil Lee,
            Uiwon Hwang,
            Seonwoo Min,
            Sungroh Yoon
             
            arXiv preprint, 2017
             
            arXiv /
            Code
            
            This work investigates an efficient musical word representation from polyphonic MIDI data for SeqGAN, simultaneously capturing chords and melodies with dynamic timings. 
           | 
         
        
          | 
            
            
           | 
          
            
              An Efficient Approach to Boosting Performance of Deep Spiking Network Training
              
            
             
            Seongsik Park,
            Sang-gil Lee,
            Hyunha Nam,
            Sungroh Yoon
             
            Neural Information Processing Systems (NIPS) Workshop on Computing with Spikes, 2016
             
            arXiv
            
            Investigates various initialization and backward control schemes of the membrane potential for training
              deep spiking networks. 
           | 
         
        
       
       
       
      
      
        
        
          
          
            | 
              
             | 
            
                          
                    Research Scientist @ NVIDIA
                    
               
              Jan 2024 - Current
               
              In the Applied Deep Learning Research team, I am working on building multi-modal large language models with a focus on audio.
               
              Sep 2021 - Jan 2022
               
              As a research intern, I worked on improving neural vocoders for high quality speech and audio synthesis, advised by
              Wei Ping and
              Boris Ginsburg.
             | 
           
          
          | 
            
           | 
          
            
                    Senior Research Engineer @ Qualcomm AI Research
                    
             
            Feb 2023 - Jan 2024
             
             
            I developed a framework for Text-to-Speech (TTS) research and development, optimized for deployment on edge devices.
           | 
         
          
            | 
              
             | 
            
              
                    Research Intern @ Microsoft Research Asia
                    
               
              Dec 2020 - May 2021
               
              I worked on diffusion-based generative models for speech synthesis, advised by
              Xu Tan,
              Chang Liu,
              Qi Meng, and
              Tao Qin.
               
              Dec 2018 - Feb 2019
               
              I worked on the Antigen Map
              Project,
              where I applied sequence models to predict antigens from genetic sequences, advised by
              Bin Shao.
             | 
           
          
            | 
              
             | 
            
              
                    Research Intern @ Kakao Corporation
                    
               
              Jul 2019 - Sep 2019
               
               
              I worked on improving speech synthesis and voice conversion models, advised by
              Jaehyeon Kim and Jaekyong Bae.
             | 
           
          
         
         
         
        
        
          
          
            | 
              
             | 
            
              
                    Ph.D. in Seoul National University
                    
               
              
                    Electrical and Computer Engineering
                    
               
              Sep 2016 - Feb 2023
              Dissertation: Deep Generative Model for Waveform Synthesis
              Integrated M.S./Ph.D. Program.   Advisor: Sungroh Yoon.
               
              
                    Dual B.S. in Seoul National University
                    
               
              
                    Electrical and Computer Engineering / Applied Biology and Chemistry
                    
               
              Mar 2010 - Aug 2016
               
              Cum Laude
             | 
           
          
         
         
         
        
          
          
            | 
              Projects
               
                During my time at DSAIL, I collaborated with Seoul
                National University Hospital on a computer-aided diagnosis project for liver cancer.
                The project yielded a high-performance medical object detection model to help reduce human errors from radiologists for the early detection of liver disease.
               
             | 
           
          
         
        
          
          
            | 
              
              
             | 
            
              
                Robust End-to-End Focal Liver Lesion Detection Using Unregistered Multiphase Computed
                  Tomography Images
                
              
               
              Sang-gil Lee*,
              Eunji Kim*,
              Jae Seok Bae*,
              Jung Hoon Kim,
              Sungroh Yoon
               
              IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI), 2021
               
              arXiv /
              Code
              
              GSSD++ provides robustness to unregistered multi-phase CT images for detecting liver lesions using
                attention-guided multi-phase alignment with deformable convolutions.  
             | 
           
          
            | 
              
              
             | 
            
              
                Liver Lesion Detection from Weakly-Labeled Multi-phase CT Volumes with a Grouped Single Shot
                  MultiBox Detector
                
              
               
              Sang-gil Lee,
              Jae Seok Bae,
              Hyunjae Kim,
              Jung Hoon Kim,
              Sungroh Yoon
               
              International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI),
              2018
               
              arXiv /
              Code
              
              GSSD pioneers a focal liver lesion detection model from multi-phase CT images, which reflects a
                real-world clinical practice of radiologists.  
             | 
           
          
         
         
         
        
          
          
            | 
              Invited Talks, Honors, and Awards
             | 
           
          
         
        
          
          
            - Invited Talk "Deep Generative Model for Speech and Audio", Soongsil
              University, 2023
            
 
            - Invited Talk "Towards Universal Neural Waveform Synthesis", Naver, 2022
            
 
            - Invited Talk "On Neural Waveform Synthesis", Supertone, 2022
 
            - Invited Talk "Prior Enhancement for Deep Generative Models", Hyundai
              AIRS,
              2022
            
 
            - Student Conference Scholarship, Google, 2022
 
            - Invited Talk "Neural Speech Synthesis: a 2021 Landscape", NVIDIA,
              2021
            
 
            - Graduate Student of the Year, DSAIL, Seoul National University, 2019
            
 
            - Best Paper Award, Hyundai AIR Lab (currently AIRS), 2019
 
            - Stars of Tomorrow (Excellent Intern), Microsoft Research Asia,
              2019
            
 
            - Invited Talk "RNN Plus Alpha: Is RNN the False Prophet?", Naver CLOVA,
              2018
            
 
            - Cum Laude, Seoul National University, 2016
 
            - Academic Performance Scholarship, Seoul National University, 2010 -
              2016
            
 
            - Academic Scholarship (fully funded), SBS Foundation, 2010 -
              2016
            
 
           
          
         
        
        
          
          
              | 
            
              I am a PC hardware enthusiast, always eager to learn about computers in my free time.
               
               
              As a hobbyist DJ, I enjoy house music. My mixes on YouTube
             | 
           
          
         
        
          
          
            
               
              
                Last update: Jan 2025. Template borrowed from here.
               
             | 
           
          
         
    
  
 
 |