Android - Compare voice recordings and trigger a function
Page 1 of 1
X_Dror




Posts: 4955
Location: Jerusalem, Israel
PostPosted: Sat, 23rd Apr 2011 16:46    Post subject: Android - Compare voice recordings and trigger a function
Hey,

I got some interesting idea for a cellphone application, and since Android seems like a nice and easy platform for it, I started to code it with Android SDK.

Anyhow, I'm looking for a solution for a main functionality in my application.

My application need to record several voice commands from the user, and then attach each recording to a specific "command" in my application.

Like: 1.Record a command and save it in the memory
2. Prompt the user to say a command
3. Compare the new command with the recorded one, if Similar, call function X, if not call function Y,
etc..

Basically all I want is to actually imitate the function that many phones have, when you tag a phone contact with a voice recording, and then you can call that contact if you repeat his recorded name with your voice.

All I found so far is a way-over-the-top solution with Google's voice service, which I certainly don't need and don't want.
I don't need a fancy speech-to-text functionality, and I don't want to depend on the Internet. The API needs to be strictly local.

Does a "voice comparison" function exist in Android SDK? If not, can I implement it without too much hassle?

So far no one could help me with this problem... maybe NFOHumpers won't disappoint me Razz

Thanks a lot!
Back to top
garus
VIP Member



Posts: 34200

PostPosted: Sat, 23rd Apr 2011 17:15    Post subject:
snip


Last edited by garus on Tue, 27th Aug 2024 21:50; edited 2 times in total
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Sat, 23rd Apr 2011 17:19    Post subject:
This is quite a complicated feature you're asking for. Actually it sounds complicated to me because I did no audio programming til now but basically I don't think that there is anything already present.

IF you are going to write it yourself (don't know how good you are), try to sample the keyword down and store it in different qualities. It should be easier to compare it this way.

Let's say you have several words and you store all of them at 80%, 55%, 30% and 10% accuracy. Now you'd make a quick comparison on all of them if they have a hit on 10% accuracy. The ones that sound alike should then be checked against the other levels. The last one remaining is probably the keyword that matches.

When doing this for images they have similar techniques. Like taking out the color of the image and reducing the resolution and checking if there are similarities.

And you have to take into account that the user that is recording it's voice might be in a noisy environment while trying to capture his voice so you'd need to blockout all frequencies that usally do not belong to human speech (everything except ~100-~500hz). Or that you have to trim everything before and after the keyword and so on...


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
garus
VIP Member



Posts: 34200

PostPosted: Sat, 23rd Apr 2011 17:23    Post subject:
snip


Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
Back to top
X_Dror




Posts: 4955
Location: Jerusalem, Israel
PostPosted: Sat, 23rd Apr 2011 17:49    Post subject:
Voice recognition can indeed be a complicated matter, but..
the thing is that I don't need speech-to-text like Google did.
I also don't need a big keywords dictionary.

The user can have about 6 unique words that he can record, and each time he gives a command it can only be a single word.
I never done any kind of speech recognition software, and I don't have any experience with analyzing voice data, but I think matching a sound with 6 different audio files doesn't have to be that difficult.

Another reason why I don't see why this function wouldn't exist already on an Android is because many older phones had this functionality where you could put "voice tags" on phone contacts, and then automatically call them by repeating their recorded voice tag.

This is the kind of functionality I need, and not Google's advanced speech-to-text.

If this functionality doesn't exist on the Android, then maybe I could use an existing implementation that I could somehow import to the Android.

Thanks for the comments so far!
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Sat, 23rd Apr 2011 17:56    Post subject:
@Garus: As you seem to know this matter -> Is sound information stored three dimensional? (Time, frequency, volume)?


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
garus
VIP Member



Posts: 34200

PostPosted: Sat, 23rd Apr 2011 18:01    Post subject:
snip


Last edited by garus on Tue, 27th Aug 2024 21:50; edited 2 times in total
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Sat, 23rd Apr 2011 18:06    Post subject:
Windows 7 speech recognition does not work and I went through all tutorials and let that shit listen to my voice OVER and OVER! It doesn't even accept the most fundamental commands Sad Just stuff like "Press enter" "Press 1" and shit Sad


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Sat, 23rd Apr 2011 18:07    Post subject:
Volume must be stored too, no?


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
garus
VIP Member



Posts: 34200

PostPosted: Sat, 23rd Apr 2011 18:16    Post subject:
snip


Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
Back to top
BearishSun




Posts: 4484

PostPosted: Sat, 23rd Apr 2011 18:31    Post subject:
I'm not well informed in this area but I think the approach would be:

Filter the signal by cutting off all non-voice frequencies
Filter and resample the signal by some convultion kernel like gaussian to reduce noise
Compare frequency in each sample to the original sample and based on how close/far they are, assign a score to the sample. Keep a total score count.
(Actually first you should compare smaller pieces of the input signal with the stored one, and search the entire stored signal to find a starting point. And from there you can do a proper comparison like I mention above)
Normalize the score by weighting it by frequency range and sample count

If score is above some limit you have a match, if below you don't.

That's how I would start with the issue, although I don't know shit about audio processing really.
Back to top
me7




Posts: 3936

PostPosted: Sat, 23rd Apr 2011 18:48    Post subject:
PumpAction wrote:
Volume must be stored too, no?


From what I know uncompressed audio is (usually) stored as a sequence of integers that are used to reconstruct the "sonic wave". Audio CDs store 16bit integers 44100 times per second (16bit 44,1KHz) and make a wave out of it during playback. Volume implied by the amplitude of the wave.
I don't know whether you'd call it one or two dimensional, I'd say it's one dimensional.
Back to top
Werelds
Special Little Man



Posts: 15098
Location: 0100111001001100
PostPosted: Sat, 23rd Apr 2011 19:06    Post subject:
That's the codec, has nothing to do with the actual signal. The only thing that might happen, is that the codec of choice is too inaccurate to deal with a specific signal, and the decoded signal turns out different than the original; that's what happens with the GSM codec for example, one of the best examples of the compromise between a signal and its digital representation.

What garus said is true, this you won't be able to do by yourself. You have to account for both differences in frequency (pitch) and amplitude (volume) between voices. That's the "easy" part. The hard part is wavelength, which will be incredibly different between different accents; and that is just taking English as an example. Within England itself alone (excluding the US) you already easily have 50 accents, in each of which there will be some letter pronounced radically different from others. Throw the US, Canada and India into the mix for some other "native" English speaking countries. Done with that? Fine, now you can go deal with the rest of the world.

Honestly, you're underestimating the work that goes into this, there's a reason why it's only been in recent years that there's been software that *actually* works pretty well. I haven't even begun to discuss things like intonation.
Back to top
BearishSun




Posts: 4484

PostPosted: Sat, 23rd Apr 2011 19:09    Post subject:
Werelds wrote:
That's the codec, has nothing to do with the actual signal. The only thing that might happen, is that the codec of choice is too inaccurate to deal with a specific signal, and the decoded signal turns out different than the original; that's what happens with the GSM codec for example, one of the best examples of the compromise between a signal and its digital representation.

What garus said is true, this you won't be able to do by yourself. You have to account for both differences in frequency (pitch) and amplitude (volume) between voices. That's the "easy" part. The hard part is wavelength, which will be incredibly different between different accents; and that is just taking English as an example. Within England itself alone (excluding the US) you already easily have 50 accents, in each of which there will be some letter pronounced radically different from others. Throw the US, Canada and India into the mix for some other "native" English speaking countries. Done with that? Fine, now you can go deal with the rest of the world.

Honestly, you're underestimating the work that goes into this, there's a reason why it's only been in recent years that there's been software that *actually* works pretty well. I haven't even begun to discuss things like intonation.


You don't need most of those things for just comparing two signals...which is what he wants. He doesn't need to actually recognize the words spoken.
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Sat, 23rd Apr 2011 19:18    Post subject:
As bearish said, as far as I understand it, the user himself records the voices. It's not a check against any preexisting record Smile


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
Werelds
Special Little Man



Posts: 15098
Location: 0100111001001100
PostPosted: Sat, 23rd Apr 2011 19:24    Post subject:
Well in that case, AudioTrack and AudioRecord should do the trick.

Record a command, save it to file. Then for activation, read into a byte array using AudioRecord, load the saved file into AudioTrack and read that into another byte array, and then just compare the 2 arrays.

The hard part will still be determining a "good enough" comparison rate between the two, taking background noise and such into account.
Back to top
X_Dror




Posts: 4955
Location: Jerusalem, Israel
PostPosted: Sat, 23rd Apr 2011 21:02    Post subject:
Thanks for all the comments guys!
At least I'm beginning to gather some useful information from this discussion.
I've been lurking around, searching for more possible solutions and I found out that this function is really desired.

It seems like this guy was able to pull this off quite efficiently -
http://www.androidzoom.com/android_applications/tools/voice-speed-dial_taaz.html

And it's only 400 KB.
Now I only need the algorithm Razz

I assume it will be relatively easy to use the AudioTrack and AudioRecord functionality, but indeed my main problem is analyzing and matching the sound samples.

If only I could get a ready algorithm of some sort, I bet there are a few of them lurking around the web even if they are not in Java.
Back to top
garus
VIP Member



Posts: 34200

PostPosted: Sat, 23rd Apr 2011 21:04    Post subject:
snip


Last edited by garus on Tue, 27th Aug 2024 21:50; edited 1 time in total
Back to top
X_Dror




Posts: 4955
Location: Jerusalem, Israel
PostPosted: Sat, 23rd Apr 2011 21:08    Post subject:
Yea lol. Razz

Though I still think it's possible for me to do something similar, I just need to find an example of some sort, maybe a more detailed technique of how to do it.

Or maybe I can pay to some freelancer to do it for me. If you guys know a place where I can find someone who will be willing to do it (for money of course), that may also be a possible solution Smile
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1338

PostPosted: Sat, 23rd Apr 2011 23:31    Post subject:
Maybe if i drop my cookie..

If I was going down that road and trying to compare recorded voices, i would definitely look into FFT and black-out (filter) any frequencies around 1 to 5 kHz. That should be the sweet spot for human speech, altho you should look up standing telephone line standards, their pass-band is optimized for human speech, they should have the exact numbers. Then when you broke down the signal into its frequency components try a "fuzzy" match to all known recordings. You will need to compare each individual sample in a recording in order to do that (the whole time domain).

I have no idea how accurate this method could be, but its certainly better than just comparing carrots and bytes.


teey
Back to top
Werelds
Special Little Man



Posts: 15098
Location: 0100111001001100
PostPosted: Sun, 24th Apr 2011 00:27    Post subject:
Well like I said X_Dror, those 2 classes should be easy enough to at least get some raw data; as long as you make sure you always record at the same codec settings, the samples will be comparable in terms of range and accuracy. The Android framework itself is piss easy to work with, so it really is down to figuring out how to handle the samples.

The hardest part will be to filter out noise, and to determine from which points to compare 2 samples 1 to 1. For the noise filtering, do what mrt suggests: look at standing codecs. GSM is a very, very easy one. G.711, 722 and 729 are the next three "easy" popular ones. All of these are what your phone uses most likely. The papers on them are free to get (if you can't find them, I still have them on my drive somewhere from my telecommunication thing), and it'll give you a lot of insight as to how they get around noise and such. Especially the limited frequency range is important; although it's not 5, but 6 KHz that's commonly used (at least in GSM and G.722 IIRC).

I'd have a tinker with it, but I'm quite swamped with work at the moment, if you haven't figured something out in a couple of weeks I should have some time to fiddle with it Smile
Back to top
Page 1 of 1 All times are GMT + 1 Hour
NFOHump.com Forum Index - Programmers Corner
Signature/Avatar nuking: none (can be changed in your profile)  


Display posts from previous:   

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group