{"id":614,"date":"2013-02-16T20:19:01","date_gmt":"2013-02-16T19:19:01","guid":{"rendered":"http:\/\/quantum-bits.org\/?p=614"},"modified":"2022-08-12T17:34:42","modified_gmt":"2022-08-12T16:34:42","slug":"project-jarvis-step-two-speak-to-me","status":"publish","type":"post","link":"https:\/\/www.quantum-bits.org\/?p=614","title":{"rendered":"Project &#8220;Jarvis&#8221;: step two (speak to me)"},"content":{"rendered":"<table width=\"100%\">\n<tbody>\n<tr>\n<td>In my <a href=\"http:\/\/quantum-bits.org\/?p=574\" title=\"Previous post\">previous post<\/a>, I conducted a few experiments with speech recognition via Google&#8217;s Speech API and get enough results to push the project &#8220;Jarvis&#8221; a bit further.<br \/>\nNow it is time for Jarvis to speak !<\/td>\n<td><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/raspberrypi-logo.png\" alt=\"\" align=\"top\"><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<br \/>\n<strong>Text-To-Speech engines<\/strong><\/p>\n<p>There are many &#8220;Text-To-Speech&#8221; engines already packaged for the Rasberry Pi. Namely:<\/p>\n<ul>\n<li style=\"list-style: square inside; color: #aaaaaa;\"><span style=\"color: #666666;\"><b>espeak<\/b>: <a href=\"http:\/\/espeak.sourceforge.net\/\" title=\"eSpeak\" target=\"_blank\" rel=\"noopener\">eSpeak<\/a> is compact Open Source speech synthetizer (for English and other languages). It is available as a shared libray and as a command line program to speak from a file or from <code style=\"font-size:9pt\">stdin<\/code>. It can be used as a front-end to <a href=\"http:\/\/en.wikipedia.org\/wiki\/MBROLA\" title=\"MBROLA\" target=\"_blank\" rel=\"noopener\">mbrola<\/a> diphone voices. <\/span><\/li>\n<li style=\"list-style: square inside; color: #aaaaaa;\"><span style=\"color: #666666;\"><b>festival<\/b>: <a href=\"http:\/\/www.cstr.ed.ac.uk\/projects\/festival\/\" title=\"Festival\" target=\"_blank\" rel=\"noopener\">Festival Speech Synthesis System<\/a> is a multi-lingual Open Source speech synthetizer which offers Text-To-Speech capabilities with various API.<\/span><\/li>\n<li style=\"list-style: square inside; color: #aaaaaa;\"><span style=\"color: #666666;\"><b>flite<\/b>: <a href=\"http:\/\/www.speech.cs.cmu.edu\/flite\/\" title=\"Flite\" target=\"_blank\" rel=\"noopener\">festival-lite<\/a> is a small run-time speech synthesis engine developed at Carnegie Mellon University, derived from Festival. <\/span><\/li>\n<\/ul>\n<p>Let&#8217;s install and try these three engines:<\/p>\n<pre line=\"1\" lang=\"bash\">apt-get install espeak\napt-get install festival\napt-get install flite\n<\/pre>\n<p>Unfortunatley, I ran into a set of broken packages when I tried to install <code style=\"font-size:9pt\">mbrola<\/code> voices for <code style=\"font-size:9pt\">espeak<\/code> and <code style=\"font-size:9pt\">festival<\/code>:<\/p>\n<pre lang=\"bash\">root@applepie ~ # apt-get install mbrola-en1 mbrola-fr1 mbrola-fr4  mbrola-us1 mbrola-us2 mbrola-us3 festvox-en1 festvox-us1 festvox-us2 festvox-us3\nReading package lists... Done\nBuilding dependency tree       \nReading state information... Done\nSome packages could not be installed. This may mean that you have\nrequested an impossible situation or if you are using the unstable\ndistribution that some required packages have not yet been created\nor been moved out of Incoming.\nThe following information may help to resolve the situation:\n\nThe following packages have unmet dependencies:\n mbrola-en1 : Depends: mbrola but it is not installable\n mbrola-fr1 : Depends: mbrola but it is not installable\n mbrola-fr4 : Depends: mbrola but it is not installable\n mbrola-us1 : Depends: mbrola but it is not installable\n mbrola-us2 : Depends: mbrola but it is not installable\n mbrola-us3 : Depends: mbrola but it is not installable\nE: Unable to correct problems, you have held broken packages.\n<\/pre>\n<p>It meant that the outputs from espeak and festival would quite probably be rather poor in quality. Thus, I introduced a new contender as an external service: Google Text-to-Speech API.<\/p>\n<p>Here&#8217;s a little benchmark, where the speech outputs from each engine are compared, given the same quote from <a href=\"http:\/\/en.wikipedia.org\/wiki\/2001:_A_Space_Odyssey_(film)\" title=\"2001\" target=\"_blank\" rel=\"noopener\">2001 Space Odyssey<\/a>.<\/p>\n<p><strong>Benchmark #1: espeak<\/strong><\/p>\n<p>Getting a <code style=\"font-size:9pt\">.wav<\/code> file from plain text is quite easy:<\/p>\n<pre line=\"1\" lang=\"bash\">espeak \"Look Dave, I can see you're really upset about this\" --stdout &gt; espeak.wav\n<\/pre>\n<p>Here&#8217;s the <code style=\"font-size:9pt\">.wav<\/code> output from espeak:<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/espeak.wav<\/td>\n<td width=\"100px\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>espeak<\/strong><\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>As expected, it is really bad. It reminds me of the speech synthetizer I used to play with on my <a href=\"http:\/\/en.wikipedia.org\/wiki\/Atari_ST\" title=\"Atari ST\" target=\"_blank\" rel=\"noopener\">Atari 1040STF<\/a> in the 80&#8217;s \ud83d\ude41<\/p>\n<p><strong>Benchmark #2: festival<\/strong><\/p>\n<p>Getting a <code style=\"font-size:9pt\">.wav<\/code> file from plain text is also easy:<\/p>\n<pre line=\"1\" lang=\"bash\">echo \"Look Dave, I can see you're really upset about this\" | text2wave -o festival.wav\n<\/pre>\n<p>And the resulting <code style=\"font-size:9pt\">.wav<\/code> output is:<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/festival.wav<\/td>\n<td width=\"100px'\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>festival<\/strong><\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>Less robotic, but still very far from what I need for Jarvis \ud83d\ude41<\/p>\n<p><strong>Benchmark #3: flite<\/strong><\/p>\n<p>Getting a speech output form flite is as simple as it is form espeak and festival:<\/p>\n<pre line=\"1\" lang=\"bash\">echo \"Look Dave, I can see you're really upset about this\" | flite -o flite.wav\n<\/pre>\n<p>And the resulting <code style=\"font-size:9pt\">.wav<\/code> goes like this:<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/flite.wav<\/td>\n<td width=\"100px\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>flite<\/strong><\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>Better. It&#8217;s getting <a href=\"http:\/\/en.wikipedia.org\/wiki\/HAL_9000\" target=\"_blank\" rel=\"noopener\">HAL<\/a>-like, but I really need something closer to a real human voice.<\/p>\n<p><strong>Benchmark #4: Google TTS<\/strong><\/p>\n<p>Google Text-To-Speech is a private REST API. Getting results is less straightforward but noneless very easily manageable. Here&#8217;s a little PHP script:<\/p>\n<pre line=\"1\" lang=\"bash\"><!--?php\n$voice = urlencode(\"Look Dave, I can see you're really upset about this\");\n$cmd ='\/usr\/bin\/curl -A \"Mozilla\" \"http:\/\/translate.google.com\/translate_tts?tl=en_gb&ie=\"UTF-8\"&q='.$voice.'\" --> google.mp3';\nshell_exec($cmd);\n?&gt;\n<\/pre>\n<p>And here&#8217;s the result (converted to the same .wav format):<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/google1.wav<\/td>\n<td width=\"100px\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>Google<\/strong> (en_gb)<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>Much much better \ud83d\ude0e . Maybe a little too slow. Let&#8217;s try to play with localizations and switch from British English to US English:<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/google2.wav<\/td>\n<td width=\"100px\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>Google<\/strong> (en_us)<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>Surprisingly, the US voice is female \ud83d\ude00<br \/>\nNot bad. Now, let&#8217;s try a French version:<\/p>\n<p><center><\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<td width=\"98px\"><img decoding=\"async\" src=\"http:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/text2speech.png\" style=\"position:relative;top:-4px\"><\/td>\n<td width=\"300px\">\nhttp:\/\/quantum-bits.org\/wp-content\/uploads\/2013\/02\/google3.wav<\/td>\n<td width=\"100px\" align=\"left\">\n<div style=\"position:relative;top:-4px;left:4px\"><strong>Google<\/strong> (fr_fr)<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/center><\/p>\n<p>Really good. Also a female voice. It is actually very close to the synthetic voice used at SNCF (French Railroads) stations. Kind of a scary voice. It feels like &#8230; I&#8217;m gonna miss a f**king train.<\/p>\n<p>I think I&#8217;m gonna settle for the Bristish voice from Google&#8217;s Text-To-Speech Engine.<\/p>\n<p>I&#8217;ll have to rely (once more) on an external service, but a electronic butler has to be British :p<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post, I conducted a few experiments with speech recognition via Google&#8217;s Speech API and get enough results to push the project &#8220;Jarvis&#8221; a bit further. Now it &#8230;<\/p>\n","protected":false},"author":1,"featured_media":3853,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0},"categories":[5,21],"tags":[],"_links":{"self":[{"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/posts\/614"}],"collection":[{"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=614"}],"version-history":[{"count":0,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/posts\/614\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=\/wp\/v2\/media\/3853"}],"wp:attachment":[{"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=614"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=614"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.quantum-bits.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=614"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}