Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data

Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments. However, such plenty of in-domain data is not always available in the real-life world. In this paper, we propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN), where only 10 minutes of unparalleled in-domain noisy speech data is required as labels. Furthermore, we also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions. Experimental results show that the proposed speech recognition system achieves 7.3% absolute improvement with simulated noisy data by Simu-GAN over the best baseline in terms of word error rate (WER).


Speech Samples

The clean speech and real noisy speech are from the robust automatic transcription of speech (RATS) program. [1]

The simulated noisy speech is generated by propose Simu-GAN.

Please put on your earphones when linstening to these samples.

ID Clean speech Simulated noisy speech Real noisy speech
1
2
3
4
5
ID Clean speech Simulated noisy speech Real noisy speech
1
2
3
4
5
p257_252

References

[1] D. Graff, K. Walker, S. M. Strassel, X. Ma, K. Jones, andA. Sawyer. “The rats collection: Supporting hlt researchwith degraded audio data.,” in LREC.Citeseer, 2014.