Sang-gil Lee

Email  /  CV  /  LinkedIn  /  Google Scholar  /  X (Twitter)  /  GitHub

I am a research scientist at NVIDIA. I work on deep generative models for sequences, with a particular focus on speech and audio.

I received my Ph.D. from the Data Science & AI Lab (DSAIL) at Seoul National University. During my Ph.D., I served as a research intern at NVIDIA, under the advisement of Wei Ping and Boris Ginsburg. Prior to that, I completed internships at Microsoft Research Asia, where I was advised by Xu Tan, Tao Qin (speech), and Bin Shao (bioinformatics). I received my B.S. in Electrical and Computer Engineering from Seoul National University.



Research

My research interest spans a wide range of deep generative models (AR, flow, GAN, diffusion, etc.) applied to sequential data. Specifically, I am working on building multi-modal large language models with a focus on audio. During my Ph.D., I focused on time-domain waveform data (speech and audio) to advance generative modeling for audio. I am also broadly interested in speech and audio applications, including text-to-speech, voice conversion, music generation, neural audio codecs, and audio language models. Representative papers are highlighted.

Edit-A-Video: Single Video Editing with Object-Aware Consistency
Chaehun Shin*, Heeseung Kim*, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
Asian Conference on Machine Learning (ACML), Best Paper Award, 2023
project page / arXiv

Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.

BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon
International Conference on Learning Representations (ICLR), 2023
project page / arXiv / code / demo

BigVGAN is a universal audio synthesizer that achieves unprecedented zero-shot performance on various unseen environments using anti-aliased periodic nonlinearity and large-scale training.

PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu
International Conference on Learning Representations (ICLR), 2022
project page / arXiv / code / poster

PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.

NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity
Sang-gil Lee, Sungwon Kim, Sungroh Yoon
Neural Information Processing Systems (NeurIPS), 2020
arXiv / code / poster

NanoFlow uses a single neural network for multiple transformation stages in normalizing flows, which provides an efficient compression for flow-based generative models.

FloWaveNet: A Generative Flow for Raw Audio
Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, Sungroh Yoon
International Conference on Machine Learning (ICML), 2019
arXiv / code / demo / poster

FloWaveNet is one of the first flow-based generative models for fast and parallel synthesis of audio waveforms, enabling a likelihood-based neural vocoder without any auxiliary loss.

One-Shot Learning for Text-to-SQL Generation
Dongjun Lee, Jaesik Yoon, Jongyoon Song, Sang-gil Lee, Sungroh Yoon
arXiv preprint, 2019
arXiv

Template-based one-shot text-to-SQL generative model based on a Candidate Search Network & Pointer Network.

Polyphonic Music Generation with Sequence Generative Adversarial Networks
Sang-gil Lee, Uiwon Hwang, Seonwoo Min, Sungroh Yoon
arXiv preprint, 2017
arXiv / code

This work investigates an efficient musical word representation from polyphonic MIDI data for SeqGAN, simultaneously capturing chords and melodies with dynamic timings.

An Efficient Approach to Boosting Performance of Deep Spiking Network Training
Seongsik Park, Sang-gil Lee, Hyunha Nam, Sungroh Yoon
Neural Information Processing Systems (NIPS) Workshop on Computing with Spikes, 2016
arXiv

Investigates various initialization and backward control schemes of the membrane potential for training deep spiking networks.



Experience
Research Scientist @ NVIDIA
Jan 2024 - Current
In the Applied Deep Learning Research team, I am working on building multi-modal large language models with a focus on audio.
Sep 2021 - Jan 2022
As a research intern, I worked on improving neural vocoders for high quality speech and audio synthesis, advised by Wei Ping and Boris Ginsburg.
Senior Research Engineer @ Qualcomm AI Research
Feb 2023 - Jan 2024

I developed a framework for Text-to-Speech (TTS) research and development, optimized for deployment on edge devices.
Research Intern @ Microsoft Research Asia
Dec 2020 - May 2021
I worked on diffusion-based generative models for speech synthesis, advised by Xu Tan, Chang Liu, Qi Meng, and Tao Qin.
Dec 2018 - Feb 2019
I worked on the Antigen Map Project, where I applied sequence models to predict antigens from genetic sequences, advised by Bin Shao.
Research Intern @ Kakao Corporation
Jul 2019 - Sep 2019

I worked on improving speech synthesis and voice conversion models, advised by Jaehyeon Kim and Jaekyong Bae.


Education
Ph.D. in Seoul National University
Electrical and Computer Engineering
Sep 2016 - Feb 2023
  • Dissertation: Deep Generative Model for Waveform Synthesis
  • Integrated M.S./Ph.D. Program.   Advisor: Sungroh Yoon.

  • Dual B.S. in Seoul National University
    Electrical and Computer Engineering / Applied Biology and Chemistry
    Mar 2010 - Aug 2016
  • Cum Laude


  • Projects

    During my time at DSAIL, I collaborated with Seoul National University Hospital on a computer-aided diagnosis project for liver cancer. The project yielded a high-performance medical object detection model to help reduce human errors from radiologists for the early detection of liver disease.

    Robust End-to-End Focal Liver Lesion Detection Using Unregistered Multiphase Computed Tomography Images
    Sang-gil Lee*, Eunji Kim*, Jae Seok Bae*, Jung Hoon Kim, Sungroh Yoon
    IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI), 2021
    arXiv / code

    GSSD++ provides robustness to unregistered multi-phase CT images for detecting liver lesions using attention-guided multi-phase alignment with deformable convolutions.

    Liver Lesion Detection from Weakly-Labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox Detector
    Sang-gil Lee, Jae Seok Bae, Hyunjae Kim, Jung Hoon Kim, Sungroh Yoon
    International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018
    arXiv / code

    GSSD pioneers a focal liver lesion detection model from multi-phase CT images, which reflects a real-world clinical practice of radiologists.



    Invited Talks, Honors, and Awards
    Personal
    I am a PC hardware enthusiast, always eager to learn about computers in my free time.

    As a hobbyist DJ, I enjoy house music. My mixes on YouTube

    Last update: Jan 2024. Template borrowed from here.