Analysis and synthesis of audio with AI: from neurological disease to accented speech and music
Analysis and synthesis of audio with AI: from neurological disease to accented speech and music
In the modern era, new technology is opening opportunities to help various groups of people around the world. In this thesis, deep learning and audio processing is utilized to target the needs of and develop specific applications for patients with progressive neurological diseases, speakers of non-native English accents, and amateur and leisure musicians and music enjoyers. We propose a pipeline for an automated assessment of oral diadochokinesis in neurological patients and analyse acoustic features among disease type, dysarthria type, and dysarthria severity. The results confirm some of the hypotheses about the manifestation of different dysarthria and disease types in speech while showing the major effect of dysarthria severity on oral diadochokinesis performance. In Text-to-Speech, we deal with converting a speaker鈥檚 accent into a different target accent while preserving their original speaker identity. This application could benefit minorities with non-native English accents by allowing them to customize the system鈥檚 speech output for better intelligibility. In the music generation domain, we propose and demonstrate a way of enhancing and augmenting music datasets, with the introduction of MidiCaps, a large-scale captioned MIDI dataset, and with MusicBench, a music audio dataset with enhanced text captions to promote fine controllability. We utilize MusicBench to build Mustango, a controllable Text-to-Music generation system with focus on music specific commands of chords, beats, key, and tempo. Finally, we introduce SonicMaster, a text-controllable all-in-one music restoration and mastering model that we train with our proposed SonicMaster dataset.
Speaker’s profile

Jan is a PhD candidate in the Information Systems and Technology Design pillar of SUTD. He holds both BSc and MSc degrees in the Electronics and Communications programme with focus on Signal Processing and Multimedia Technology from the Czech Technical University in Prague. His research interest is in audio processing, specifically in music generation, accented speech synthesis, and speech analysis for medical purposes. Jan is interested in languages, history, and music, especially heavy metal music.