Voice conversion : alignment and mapping perspective (Record no. 30246)

000 -LEADER
fixed length control field nam a22 7a 4500
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION
fixed length control field 210205b xxu||||| |||| 00| 0 eng d
082 ## - DEWEY DECIMAL CLASSIFICATION NUMBER
Classification number 621.3994
Item number SHA
100 ## - MAIN ENTRY--PERSONAL NAME
Personal name Shah, Nirmesh J.
245 ## - TITLE STATEMENT
Title Voice conversion : alignment and mapping perspective
260 ## - PUBLICATION, DISTRIBUTION, ETC. (IMPRINT)
Place of publication, distribution, etc Gandhinagar
Name of publisher, distributor, etc Dhirubhai Ambani Institute of Information and Communication Technology
Date of publication, distribution, etc 2019
300 ## - PHYSICAL DESCRIPTION
Extent xxvi, 207 p.
500 ## - GENERAL NOTE
General note Patil, Hemant A., Thesis supervisor
Student ID No. 201321009
Thesis (Ph.D.) -Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, 2019
520 ## - SUMMARY, ETC.
Summary, etc Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker without changing the linguistic content. Each standalone VC system building consists of two stages, namely, training and testing. First, speaker-dependent features are xtracted from both speakers‘ training data.These features are first time aligned and corresponding pairs are obtained. Then a mapping function is learned among these aligned feature-pairs. Once the training step is done, during the testing stage, features are extracted from the source speaker‘s held out data. These features are converted using the mapping function. The converted features are then passed through the vocoder that will produce a converted voice. Hence, there are primarily three components of the stand-alone VC system building, namely, the alignment step, the mapping function, and the speech analysis/synthesis framework. Major contributions of this thesis are towards identifying the limitations of existing techniques, improving it, and developing new approaches for the mapping, and alignment stages of the VC. In particular, a novel Amplitude Scaling (AS) method is proposed for frequency warping (FW)-based VC, which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. To overcome the issue of overfitting in Deep Neural Network (DNN)-based VC, the idea of pre-training is popular. However, this pre-training is time-consuming, and Equires a separate network to learn the parameters of the network. Hence, whether this additional pre-training step could be avoided by using recent advances in deep learning is investigated in this thesis. The ability of Generative Adversarial Network (GAN) in estimating probability density function (pdf) for generating the realistic samples corresponding to the given source speaker‘s utterance resulted in a significant performance improvement in the area of VC. The key limitation of the vanilla GAN-based system is in generating the samples that may not correspond to the given source speaker‘s utterance. To address this issue, Minimum Mean Squared Error (MMSE) regularized GAN (i.e.,MMSE-GAN) is proposed in this thesis.Obtaining corresponding feature pairs in the context of both parallel as well as non-parallel VC is a challenging task. In this thesis, the strengths and limitations of the different existing alignment strategies are identified, and new alignment strategies are proposed for both parallel and non-parallel VC task. Wrongly aligned pairs will affect the learning of the mapping function, which in turn will deteriorate the quality of the converted voices. In order to remove such wrongly aligned pairs from the training data, outlier removal-based pre-processing technique is proposed for the parallel VC. In the case of non-parallel VC, theoretical convergence proof is developed for the popular alignment technique, namely, Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA). In addition, the use of dynamic features along with static features to calculate the Nearest Neighbor (NN) aligned pairs in the existing INCA, and Temporal context (TC) INCA is also proposed. Furthermore, a novel distance metric is learned for the NN-based search strategies, as Euclidean distance may not correlate well with the perceptual distance. Moreover, computationally simple Spectral Transition Measure (STM)-based phone alignment technique that does not require any apriori training data is also proposed for the non-parallel VC. Both the parallel and the non-parallel alignment techniques will generate oneto-many and many-to-one feature pairs. These one-to-many and many-to-one pairs will affect the learning of the mapping function and result in the muffling and oversmoothing effect in VC. Hence, unsupervised Vocal Tract Length Normalization (VTLN) posteriorgram, and novel inter mixture weighted GMM Posteriorgram as a speaker-independent representation in the two-stage mapping network is proposed in order to avoid the alignment step from the VC framework. In this thesis, an attempt has also been made to use the acoustic-to-articulatory inversion (AAI) technique for the quality assessment of the voice converted speech. Lastly, the proposed MMSE-GAN architecture is extended in the form of Discover GAN (i.e., MMSE DiscoGAN) for the cross-domain VC applications (w.r.t.attributes of the speech production mechanism), namely, Non-Audible Murmur (NAM)-to-WHiSPer (NAM2WHSP) speech conversion, and WHiSPer-to-SPeeCH (WHSP2SPCH) conversion. Finally, thesis summarizes overall work presented, limitations of various approaches along with future research directions.
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name as entry element Voice Conversion
Topical term or geographic name as entry element Parallel
Topical term or geographic name as entry element Nonparallel
Topical term or geographic name as entry element Alignment
Topical term or geographic name as entry element Metric Learning
Topical term or geographic name as entry element Frequency Warping
Topical term or geographic name as entry element INCA
Topical term or geographic name as entry element Pretraining
700 ## - ADDED ENTRY--PERSONAL NAME
Personal name Patil, Hemant A.
856 ## - ELECTRONIC LOCATION AND ACCESS
Uniform Resource Identifier http://drsr.daiict.ac.in/handle/123456789/893
942 ## - ADDED ENTRY ELEMENTS (KOHA)
Koha item type Thesis and Dissertations
Holdings
Withdrawn status Lost status Source of classification or shelving scheme Damaged status Not for loan Permanent Location Current Location Date acquired Full call number Barcode Date last seen Koha item type
          DAIICT DAIICT 2020-03-03 621.3994 SHA T00832 2021-02-05 Thesis and Dissertations

Powered by Koha