|
Page 1 of 1 |
X_Dror
Posts: 4955
Location: Jerusalem, Israel
|
Posted: Sat, 23rd Apr 2011 16:46 Post subject: Android - Compare voice recordings and trigger a function |
|
 |
Hey,
I got some interesting idea for a cellphone application, and since Android seems like a nice and easy platform for it, I started to code it with Android SDK.
Anyhow, I'm looking for a solution for a main functionality in my application.
My application need to record several voice commands from the user, and then attach each recording to a specific "command" in my application.
Like: 1.Record a command and save it in the memory
2. Prompt the user to say a command
3. Compare the new command with the recorded one, if Similar, call function X, if not call function Y,
etc..
Basically all I want is to actually imitate the function that many phones have, when you tag a phone contact with a voice recording, and then you can call that contact if you repeat his recorded name with your voice.
All I found so far is a way-over-the-top solution with Google's voice service, which I certainly don't need and don't want.
I don't need a fancy speech-to-text functionality, and I don't want to depend on the Internet. The API needs to be strictly local.
Does a "voice comparison" function exist in Android SDK? If not, can I implement it without too much hassle?
So far no one could help me with this problem... maybe NFOHumpers won't disappoint me
Thanks a lot!
|
|
Back to top |
|
 |
garus
VIP Member
Posts: 34200
|
Posted: Sat, 23rd Apr 2011 17:15 Post subject: |
|
 |
snip
Last edited by garus on Tue, 27th Aug 2024 21:50; edited 2 times in total
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 17:19 Post subject: |
|
 |
This is quite a complicated feature you're asking for. Actually it sounds complicated to me because I did no audio programming til now but basically I don't think that there is anything already present.
IF you are going to write it yourself (don't know how good you are), try to sample the keyword down and store it in different qualities. It should be easier to compare it this way.
Let's say you have several words and you store all of them at 80%, 55%, 30% and 10% accuracy. Now you'd make a quick comparison on all of them if they have a hit on 10% accuracy. The ones that sound alike should then be checked against the other levels. The last one remaining is probably the keyword that matches.
When doing this for images they have similar techniques. Like taking out the color of the image and reducing the resolution and checking if there are similarities.
And you have to take into account that the user that is recording it's voice might be in a noisy environment while trying to capture his voice so you'd need to blockout all frequencies that usally do not belong to human speech (everything except ~100-~500hz). Or that you have to trim everything before and after the keyword and so on...
|
|
Back to top |
|
 |
garus
VIP Member
Posts: 34200
|
Posted: Sat, 23rd Apr 2011 17:23 Post subject: |
|
 |
snip
Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
|
|
Back to top |
|
 |
X_Dror
Posts: 4955
Location: Jerusalem, Israel
|
Posted: Sat, 23rd Apr 2011 17:49 Post subject: |
|
 |
Voice recognition can indeed be a complicated matter, but..
the thing is that I don't need speech-to-text like Google did.
I also don't need a big keywords dictionary.
The user can have about 6 unique words that he can record, and each time he gives a command it can only be a single word.
I never done any kind of speech recognition software, and I don't have any experience with analyzing voice data, but I think matching a sound with 6 different audio files doesn't have to be that difficult.
Another reason why I don't see why this function wouldn't exist already on an Android is because many older phones had this functionality where you could put "voice tags" on phone contacts, and then automatically call them by repeating their recorded voice tag.
This is the kind of functionality I need, and not Google's advanced speech-to-text.
If this functionality doesn't exist on the Android, then maybe I could use an existing implementation that I could somehow import to the Android.
Thanks for the comments so far!
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 17:56 Post subject: |
|
 |
@Garus: As you seem to know this matter -> Is sound information stored three dimensional? (Time, frequency, volume)?
|
|
Back to top |
|
 |
garus
VIP Member
Posts: 34200
|
Posted: Sat, 23rd Apr 2011 18:01 Post subject: |
|
 |
snip
Last edited by garus on Tue, 27th Aug 2024 21:50; edited 2 times in total
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 18:06 Post subject: |
|
 |
Windows 7 speech recognition does not work and I went through all tutorials and let that shit listen to my voice OVER and OVER! It doesn't even accept the most fundamental commands Just stuff like "Press enter" "Press 1" and shit 
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 18:07 Post subject: |
|
 |
Volume must be stored too, no?
|
|
Back to top |
|
 |
garus
VIP Member
Posts: 34200
|
Posted: Sat, 23rd Apr 2011 18:16 Post subject: |
|
 |
snip
Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 18:31 Post subject: |
|
 |
I'm not well informed in this area but I think the approach would be:
Filter the signal by cutting off all non-voice frequencies
Filter and resample the signal by some convultion kernel like gaussian to reduce noise
Compare frequency in each sample to the original sample and based on how close/far they are, assign a score to the sample. Keep a total score count.
(Actually first you should compare smaller pieces of the input signal with the stored one, and search the entire stored signal to find a starting point. And from there you can do a proper comparison like I mention above)
Normalize the score by weighting it by frequency range and sample count
If score is above some limit you have a match, if below you don't.
That's how I would start with the issue, although I don't know shit about audio processing really.
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 18:48 Post subject: |
|
 |
PumpAction wrote: | Volume must be stored too, no? |
From what I know uncompressed audio is (usually) stored as a sequence of integers that are used to reconstruct the "sonic wave". Audio CDs store 16bit integers 44100 times per second (16bit 44,1KHz) and make a wave out of it during playback. Volume implied by the amplitude of the wave.
I don't know whether you'd call it one or two dimensional, I'd say it's one dimensional.
|
|
Back to top |
|
 |
Werelds
Special Little Man
Posts: 15098
Location: 0100111001001100
|
Posted: Sat, 23rd Apr 2011 19:06 Post subject: |
|
 |
That's the codec, has nothing to do with the actual signal. The only thing that might happen, is that the codec of choice is too inaccurate to deal with a specific signal, and the decoded signal turns out different than the original; that's what happens with the GSM codec for example, one of the best examples of the compromise between a signal and its digital representation.
What garus said is true, this you won't be able to do by yourself. You have to account for both differences in frequency (pitch) and amplitude (volume) between voices. That's the "easy" part. The hard part is wavelength, which will be incredibly different between different accents; and that is just taking English as an example. Within England itself alone (excluding the US) you already easily have 50 accents, in each of which there will be some letter pronounced radically different from others. Throw the US, Canada and India into the mix for some other "native" English speaking countries. Done with that? Fine, now you can go deal with the rest of the world.
Honestly, you're underestimating the work that goes into this, there's a reason why it's only been in recent years that there's been software that *actually* works pretty well. I haven't even begun to discuss things like intonation.
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 19:09 Post subject: |
|
 |
Werelds wrote: | That's the codec, has nothing to do with the actual signal. The only thing that might happen, is that the codec of choice is too inaccurate to deal with a specific signal, and the decoded signal turns out different than the original; that's what happens with the GSM codec for example, one of the best examples of the compromise between a signal and its digital representation.
What garus said is true, this you won't be able to do by yourself. You have to account for both differences in frequency (pitch) and amplitude (volume) between voices. That's the "easy" part. The hard part is wavelength, which will be incredibly different between different accents; and that is just taking English as an example. Within England itself alone (excluding the US) you already easily have 50 accents, in each of which there will be some letter pronounced radically different from others. Throw the US, Canada and India into the mix for some other "native" English speaking countries. Done with that? Fine, now you can go deal with the rest of the world.
Honestly, you're underestimating the work that goes into this, there's a reason why it's only been in recent years that there's been software that *actually* works pretty well. I haven't even begun to discuss things like intonation. |
You don't need most of those things for just comparing two signals...which is what he wants. He doesn't need to actually recognize the words spoken.
|
|
Back to top |
|
 |
|
Posted: Sat, 23rd Apr 2011 19:18 Post subject: |
|
 |
As bearish said, as far as I understand it, the user himself records the voices. It's not a check against any preexisting record 
|
|
Back to top |
|
 |
Werelds
Special Little Man
Posts: 15098
Location: 0100111001001100
|
Posted: Sat, 23rd Apr 2011 19:24 Post subject: |
|
 |
Well in that case, AudioTrack and AudioRecord should do the trick.
Record a command, save it to file. Then for activation, read into a byte array using AudioRecord, load the saved file into AudioTrack and read that into another byte array, and then just compare the 2 arrays.
The hard part will still be determining a "good enough" comparison rate between the two, taking background noise and such into account.
|
|
Back to top |
|
 |
X_Dror
Posts: 4955
Location: Jerusalem, Israel
|
|
Back to top |
|
 |
garus
VIP Member
Posts: 34200
|
Posted: Sat, 23rd Apr 2011 21:04 Post subject: |
|
 |
snip
Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
|
|
Back to top |
|
 |
X_Dror
Posts: 4955
Location: Jerusalem, Israel
|
|
Back to top |
|
 |
[mrt]
[Admin] Code Monkey
Posts: 1338
|
Posted: Sat, 23rd Apr 2011 23:31 Post subject: |
|
 |
Maybe if i drop my cookie..
If I was going down that road and trying to compare recorded voices, i would definitely look into FFT and black-out (filter) any frequencies around 1 to 5 kHz. That should be the sweet spot for human speech, altho you should look up standing telephone line standards, their pass-band is optimized for human speech, they should have the exact numbers. Then when you broke down the signal into its frequency components try a "fuzzy" match to all known recordings. You will need to compare each individual sample in a recording in order to do that (the whole time domain).
I have no idea how accurate this method could be, but its certainly better than just comparing carrots and bytes.
teey
|
|
Back to top |
|
 |
Werelds
Special Little Man
Posts: 15098
Location: 0100111001001100
|
Posted: Sun, 24th Apr 2011 00:27 Post subject: |
|
 |
Well like I said X_Dror, those 2 classes should be easy enough to at least get some raw data; as long as you make sure you always record at the same codec settings, the samples will be comparable in terms of range and accuracy. The Android framework itself is piss easy to work with, so it really is down to figuring out how to handle the samples.
The hardest part will be to filter out noise, and to determine from which points to compare 2 samples 1 to 1. For the noise filtering, do what mrt suggests: look at standing codecs. GSM is a very, very easy one. G.711, 722 and 729 are the next three "easy" popular ones. All of these are what your phone uses most likely. The papers on them are free to get (if you can't find them, I still have them on my drive somewhere from my telecommunication thing), and it'll give you a lot of insight as to how they get around noise and such. Especially the limited frequency range is important; although it's not 5, but 6 KHz that's commonly used (at least in GSM and G.722 IIRC).
I'd have a tinker with it, but I'm quite swamped with work at the moment, if you haven't figured something out in a couple of weeks I should have some time to fiddle with it 
|
|
Back to top |
|
 |
Page 1 of 1 |
All times are GMT + 1 Hour |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group
|
|
 |
|