Overview
The goal of this project is to create the largest, publicly available corpus of 2-party conversations about general topics which have transcripts and speech correctly aligned by correcting the alignment of transcripts and speech of the 1155 conversations in the Switchboard Dialog Act (SwDA) corpus. Dialogue act prediction and production is of seminal importance today in research, government and industry, as more and more dialogue systems are being built to interact with people for training, education, decreasing the human workload in call centers, and providing problem-solving advice. However, there are few large labeled corpora for researchers to use for model-building and analysis of general conversational speech. The transcripts and speech of this corpus, created from the larger Switchboard Corpus in the late 1990s, were originally aligned with a GMM-HMM Switchboard recognizer and results of the alignment are very poor, making it extremely difficult to make use of both speech and text data to predict or learn to generate dialogue acts correctly: most users have found that using the aligned audio information does not improve, and sometimes worsens, their dialogue act prediction or generation scores. The goal of this project is to re-align each side of the SwDA transcripts with the speaker’s audio to manually to correct the errors from the early automatic alignment to make the corpus of much greater value to the dialogue research community.
Source: Spoken Language Processing Group, Columbia Engineering