Normal view MARC view ISBD view

Voice conversion : alignment and mapping perspective (Record no. 30246)

000 -LEADER
fixed length control field	nam a22 7a 4500
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION
fixed length control field	210205b xxu\|\|\|\|\| \|\|\|\| 00\| 0 eng d
082 ## - DEWEY DECIMAL CLASSIFICATION NUMBER
Classification number	621.3994
Item number	SHA
100 ## - MAIN ENTRY--PERSONAL NAME
Personal name	Shah, Nirmesh J.
245 ## - TITLE STATEMENT
Title	Voice conversion : alignment and mapping perspective
260 ## - PUBLICATION, DISTRIBUTION, ETC. (IMPRINT)
Place of publication, distribution, etc	Gandhinagar
Name of publisher, distributor, etc	Dhirubhai Ambani Institute of Information and Communication Technology
Date of publication, distribution, etc	2019
300 ## - PHYSICAL DESCRIPTION
Extent	xxvi, 207 p.
500 ## - GENERAL NOTE
General note	Patil, Hemant A., Thesis supervisor Student ID No. 201321009 Thesis (Ph.D.) -Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, 2019
520 ## - SUMMARY, ETC.
Summary, etc	Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker without changing the linguistic content. Each standalone VC system building consists of two stages, namely, training and testing. First, speaker-dependent features are xtracted from both speakers‘ training data.These features are first time aligned and corresponding pairs are obtained. Then a mapping function is learned among these aligned feature-pairs. Once the training step is done, during the testing stage, features are extracted from the source speaker‘s held out data. These features are converted using the mapping function. The converted features are then passed through the vocoder that will produce a converted voice. Hence, there are primarily three components of the stand-alone VC system building, namely, the alignment step, the mapping function, and the speech analysis/synthesis framework. Major contributions of this thesis are towards identifying the limitations of existing techniques, improving it, and developing new approaches for the mapping, and alignment stages of the VC. In particular, a novel Amplitude Scaling (AS) method is proposed for frequency warping (FW)-based VC, which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. To overcome the issue of overfitting in Deep Neural Network (DNN)-based VC, the idea of pre-training is popular. However, this pre-training is time-consuming, and Equires a separate network to learn the parameters of the network. Hence, whether this additional pre-training step could be avoided by using recent advances in deep learning is investigated in this thesis. The ability of Generative Adversarial Network (GAN) in estimating probability density function (pdf) for generating the realistic samples corresponding to the given source speaker‘s utterance resulted in a significant performance improvement in the area of VC. The key limitation of the vanilla GAN-based system is in generating the samples that may not correspond to the given source speaker‘s utterance. To address this issue, Minimum Mean Squared Error (MMSE) regularized GAN (i.e.,MMSE-GAN) is proposed in this thesis.Obtaining corresponding feature pairs in the context of both parallel as well as non-parallel VC is a challenging task. In this thesis, the strengths and limitations of the different existing alignment strategies are identified, and new alignment strategies are proposed for both parallel and non-parallel VC task. Wrongly aligned pairs will affect the learning of the mapping function, which in turn will deteriorate the quality of the converted voices. In order to remove such wrongly aligned pairs from the training data, outlier removal-based pre-processing technique is proposed for the parallel VC. In the case of non-parallel VC, theoretical convergence proof is developed for the popular alignment technique, namely, Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA). In addition, the use of dynamic features along with static features to calculate the Nearest Neighbor (NN) aligned pairs in the existing INCA, and Temporal context (TC) INCA is also proposed. Furthermore, a novel distance metric is learned for the NN-based search strategies, as Euclidean distance may not correlate well with the perceptual distance. Moreover, computationally simple Spectral Transition Measure (STM)-based phone alignment technique that does not require any apriori training data is also proposed for the non-parallel VC. Both the parallel and the non-parallel alignment techniques will generate oneto-many and many-to-one feature pairs. These one-to-many and many-to-one pairs will affect the learning of the mapping function and result in the muffling and oversmoothing effect in VC. Hence, unsupervised Vocal Tract Length Normalization (VTLN) posteriorgram, and novel inter mixture weighted GMM Posteriorgram as a speaker-independent representation in the two-stage mapping network is proposed in order to avoid the alignment step from the VC framework. In this thesis, an attempt has also been made to use the acoustic-to-articulatory inversion (AAI) technique for the quality assessment of the voice converted speech. Lastly, the proposed MMSE-GAN architecture is extended in the form of Discover GAN (i.e., MMSE DiscoGAN) for the cross-domain VC applications (w.r.t.attributes of the speech production mechanism), namely, Non-Audible Murmur (NAM)-to-WHiSPer (NAM2WHSP) speech conversion, and WHiSPer-to-SPeeCH (WHSP2SPCH) conversion. Finally, thesis summarizes overall work presented, limitations of various approaches along with future research directions.
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name as entry element	Voice Conversion

Topical term or geographic name as entry element	Parallel

Topical term or geographic name as entry element	Nonparallel

Topical term or geographic name as entry element	Alignment

Topical term or geographic name as entry element	Metric Learning

Topical term or geographic name as entry element	Frequency Warping

Topical term or geographic name as entry element	INCA

Topical term or geographic name as entry element	Pretraining
700 ## - ADDED ENTRY--PERSONAL NAME
Personal name	Patil, Hemant A.
856 ## - ELECTRONIC LOCATION AND ACCESS
Uniform Resource Identifier	http://drsr.daiict.ac.in/handle/123456789/893
942 ## - ADDED ENTRY ELEMENTS (KOHA)
Koha item type	Thesis and Dissertations

Holdings
Withdrawn status	Lost status	Source of classification or shelving scheme	Damaged status	Not for loan	Permanent Location	Current Location	Date acquired	Full call number	Barcode	Date last seen	Koha item type
					DAIICT	DAIICT	2020-03-03	621.3994 SHA	T00832	2021-02-05	Thesis and Dissertations

Koha online

Voice conversion : alignment and mapping perspective (Record no. 30246)