Sang-gil Lee

Email / CV / LinkedIn / Google Scholar / X (Twitter) / GitHub

I am a research scientist at NVIDIA. I work on deep generative models for sequences, with a particular focus on speech and audio.

I received my Ph.D. from the Data Science & AI Lab (DSAIL) at Seoul National University. During my Ph.D., I served as a research intern at NVIDIA, under the advisement of Wei Ping and Boris Ginsburg. Prior to that, I completed internships at Microsoft Research Asia, where I was advised by Xu Tan, Tao Qin (speech), and Bin Shao (bioinformatics). I received my B.S. in Electrical and Computer Engineering from Seoul National University.

Research

My research interest spans a wide range of deep generative models (AR, flow, GAN, diffusion, etc.) applied to sequential data. Specifically, I am working on building multi-modal large language models with a focus on audio.

During my Ph.D., I focused on time-domain waveform data (speech and audio) to advance generative modeling for audio.

I am also broadly interested in speech and audio applications, including text-to-speech, voice conversion, music generation, neural audio codecs, and audio language models.

Representative papers are highlighted.

	ETTA: Elucidating the Design Space of Text-to-Audio Models Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro arXiv preprint, 2024 Project Page / arXiv ETTA is the first text-to-audio model with emergent abilities, capable of synthesizing entirely novel, imaginative sounds beyond the real world by leveraging large-scale synthetic audio captions (AF-Synthetic).
	BigVGAN: A Universal Neural Vocoder with Large-Scale Training Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon International Conference on Learning Representations (ICLR), 2023 Project Page / Model / arXiv / Code / Demo BigVGAN is a universal audio synthesizer that achieves unprecedented zero-shot performance on various unseen environments using anti-aliased periodic nonlinearity and large-scale training.
	Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee ICASSP, 2025 Project Page / Model / arXiv / Code A neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second.
	Improving Text-To-Audio Models with Synthetic Captions Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro Interspeech SynData4GenAI, 2024 Dataset / Model / arXiv AF-AudioSet is a large-scale audio dataset featuring synthetic captions generated by Audio Flamingo, enabling significant improvements in text-to-audio models.
	VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon INTERSPEECH, 2024 Project Page / arXiv VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
	Edit-A-Video: Single Video Editing with Object-Aware Consistency Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon Asian Conference on Machine Learning (ACML), Best Paper Award, 2023 Project Page / arXiv Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.
	PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu International Conference on Learning Representations (ICLR), 2022 Project Page / arXiv / Code / Poster PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.
	NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity Sang-gil Lee, Sungwon Kim, Sungroh Yoon Neural Information Processing Systems (NeurIPS), 2020 arXiv / Code / Poster NanoFlow uses a single neural network for multiple transformation stages in normalizing flows, which provides an efficient compression for flow-based generative models.
	FloWaveNet: A Generative Flow for Raw Audio Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, Sungroh Yoon International Conference on Machine Learning (ICML), 2019 arXiv / Code / Demo / Poster FloWaveNet is one of the first flow-based generative models for fast and parallel synthesis of audio waveforms, enabling a likelihood-based neural vocoder without any auxiliary loss.
	One-Shot Learning for Text-to-SQL Generation Dongjun Lee, Jaesik Yoon, Jongyoon Song, Sang-gil Lee, Sungroh Yoon arXiv preprint, 2019 arXiv Template-based one-shot text-to-SQL generative model based on a Candidate Search Network & Pointer Network.
	Polyphonic Music Generation with Sequence Generative Adversarial Networks Sang-gil Lee, Uiwon Hwang, Seonwoo Min, Sungroh Yoon arXiv preprint, 2017 arXiv / Code This work investigates an efficient musical word representation from polyphonic MIDI data for SeqGAN, simultaneously capturing chords and melodies with dynamic timings.
	An Efficient Approach to Boosting Performance of Deep Spiking Network Training Seongsik Park, Sang-gil Lee, Hyunha Nam, Sungroh Yoon Neural Information Processing Systems (NIPS) Workshop on Computing with Spikes, 2016 arXiv Investigates various initialization and backward control schemes of the membrane potential for training deep spiking networks.

Experience

	Research Scientist @ NVIDIA Jan 2024 - Current In the Applied Deep Learning Research team, I am working on building multi-modal large language models with a focus on audio. Sep 2021 - Jan 2022 As a research intern, I worked on improving neural vocoders for high quality speech and audio synthesis, advised by Wei Ping and Boris Ginsburg.
	Senior Research Engineer @ Qualcomm AI Research Feb 2023 - Jan 2024 I developed a framework for Text-to-Speech (TTS) research and development, optimized for deployment on edge devices.
	Research Intern @ Microsoft Research Asia Dec 2020 - May 2021 I worked on diffusion-based generative models for speech synthesis, advised by Xu Tan, Chang Liu, Qi Meng, and Tao Qin. Dec 2018 - Feb 2019 I worked on the Antigen Map Project, where I applied sequence models to predict antigens from genetic sequences, advised by Bin Shao.
	Research Intern @ Kakao Corporation Jul 2019 - Sep 2019 I worked on improving speech synthesis and voice conversion models, advised by Jaehyeon Kim and Jaekyong Bae.

Education

Ph.D. in Seoul National University
Electrical and Computer Engineering
Sep 2016 - Feb 2023

Dissertation: Deep Generative Model for Waveform Synthesis

Integrated M.S./Ph.D. Program. Advisor: Sungroh Yoon.

Dual B.S. in Seoul National University
Electrical and Computer Engineering / Applied Biology and Chemistry
Mar 2010 - Aug 2016

Cum Laude

Projects

During my time at DSAIL, I collaborated with Seoul National University Hospital on a computer-aided diagnosis project for liver cancer. The project yielded a high-performance medical object detection model to help reduce human errors from radiologists for the early detection of liver disease.

Robust End-to-End Focal Liver Lesion Detection Using Unregistered Multiphase Computed Tomography Images
Sang-gil Lee*, Eunji Kim*, Jae Seok Bae*, Jung Hoon Kim, Sungroh Yoon
IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI), 2021
arXiv / Code

GSSD++ provides robustness to unregistered multi-phase CT images for detecting liver lesions using attention-guided multi-phase alignment with deformable convolutions.

Liver Lesion Detection from Weakly-Labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox Detector
Sang-gil Lee, Jae Seok Bae, Hyunjae Kim, Jung Hoon Kim, Sungroh Yoon
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018
arXiv / Code

GSSD pioneers a focal liver lesion detection model from multi-phase CT images, which reflects a real-world clinical practice of radiologists.

Invited Talks, Honors, and Awards

Invited Talk "Deep Generative Model for Speech and Audio", Soongsil University, 2023
Invited Talk "Towards Universal Neural Waveform Synthesis", Naver, 2022
Invited Talk "On Neural Waveform Synthesis", Supertone, 2022
Invited Talk "Prior Enhancement for Deep Generative Models", Hyundai AIRS, 2022
Student Conference Scholarship, Google, 2022
Invited Talk "Neural Speech Synthesis: a 2021 Landscape", NVIDIA, 2021
Graduate Student of the Year, DSAIL, Seoul National University, 2019
Best Paper Award, Hyundai AIR Lab (currently AIRS), 2019
Stars of Tomorrow (Excellent Intern), Microsoft Research Asia, 2019
Invited Talk "RNN Plus Alpha: Is RNN the False Prophet?", Naver CLOVA, 2018
Cum Laude, Seoul National University, 2016
Academic Performance Scholarship, Seoul National University, 2010 - 2016
Academic Scholarship (fully funded), SBS Foundation, 2010 - 2016

Personal

I am a PC hardware enthusiast, always eager to learn about computers in my free time.

As a hobbyist DJ, I enjoy house music. My mixes on YouTube

Last update: Jan 2025. Template borrowed from here.