I’ve been playing around with Whisper, “a general-purpose speech recognition model.” You can install it in one line:
pip install git+https://github.com/openai/whisper.git
It can be used as a python library, but it’s also usable as a command line program. The following line takes a file, music.mp3
, and outputs a music.mp3.txt
and music.mp3.vtt
file containing the transcription:
whisper music.mp3 --model base.en
There are several models to choose from with varying VRAM requirements, and model.en variants for each for better performance with English. If you have an Nvidia GPU definitely give it a try, it’s impressive how good it is with English.
Especially exciting to me is that the model can do other languages and translate to English as well. The translation part doesn’t work the best yet, but someday this will be a valid way of getting subtitles for any media in another language:
whisper japanese_cartoons.mkv --language Japanese --task translate --model base
The translation stuff is really cool, but it’s the ability of a computer to look at audio and transcribe it into text that made me wonder what is a language anyway?
In particular, the whisper AI made me question whether language is something that can be quantified such that a computer can understand it perfectly. For instance, you can feed the model some people with strange accents and it will oftentimes get the transcription correct, but for fun I tried giving it an English Vocaloid song [watch?v=vW9_5giCK1I] and the model did horrible, only getting a few sentences in the whole song correct. It made me wonder if the human speech patterns the AI was trained on has some quality that the highly synthesized sound of Vocaloids are lacking.
Here’s one definition of the word language:
Language, n. Communication of thoughts and feelings through a system of arbitrary signals, such as voice sounds, gestures, or written symbols.
And here’s another:
Language, n. Such a system as used by a nation, people, or other distinct community; often contrasted with dialect.
In the first definition, the usage of the term “arbitrary” is incredibly interesting to me. Can I say uuga booga duuga, and that’s language? No one would know what it means. I would have to prescribe a meaning to the phrase, but does just me knowing the phrase’s meaning imply I’ve created my own language? The second definition implies a different understanding of language as a medium of communication officially established and widely understood by some community. This sounds like something an AI or computer would someday be capable of understanding, but I’m not sure if language as some sort of arbitrary, non-standardized method of communication can ever be understood.
-.- .- –. .- – .. / .. … / -… . … - / –. .. .-. .-..
Oh, sorry, I was writing in Morse code, the standardized method of communication for telegrams. At least, I assume that’s what it is since I used a Morse code translator online to get it, a translator which uses a computer to read in English text and output the equivalent Morse code. Arbitrary symbols are easy for humans to create to represent all kinds of things.
Just look at math: ∃x ∈ {0,1} -> x=0 ∨ x=1
. In English, this means “there exists x, an element of the set {0, 1}, which implies x is equal to 0 or x is equal to 1.” Been a while since I’ve written in that notation so the right side of the implication might not make any sense, but this is language, a mathematical language of getting thoughts about math across to other mathematicians.
Programming languages are languages too. From obvious stuff like:
/* C */
#include <stdio.h>
int main() {
printf("Hello World\n");
}
To more unfamiliar languages solving less trivial problems like:
/* Prolog */
ackermann(0, N, X) :- X is N+1, !.
ackermann(M, 0, X) :- !, M > 0, M1 is M-1, ack(M1, 1, X).
ackermann(M, N, X) :- N > 0, M > 0, M1 is M-1, N1 is N-1, ack(M, N1, Y), ack(M1, Y, X).
Programming languages in particular are interesting because while they are designed so that other programmers can understand the code, their initial and still primary intent is to provide instructions to computers. In contrast, it must be much harder for computers to parse the English language.
There are also languages like Elvish. J.R.R. Tolkien played around with artificially made languages all his childhood, eventually leading himself to make Elvish. Elvish uses it’s own unique symbols, see here for more on the subject. He gave that language meaning himself through building the characters and words himself, and then spreading the language through his books.
That all said, whisper in it’s current state cannot “understand” language in the truest sense. While I understand certain Google employees who want some media publicity like to claim their AI is sentient, there is a fine line between being able to translate spoken text into its written equivalent and being able to discern the meaning of that text. The “thoughts and feelings” that language is meant to portray is lost upon a computer, and there’s no real way for that problem to be solved anytime soon.
Even with programming languages, the computer doesn’t know the end goal of what it is being instructed to do. For instance, the computer doesn’t understand that that previously mentioned C code is a hello world program; rather, the computer sees it as storing certain values in it’s certain registers on the computer and computing certain values. In case you don’t believe me, here is the output of gcc -S hello.c
, which converts the C code into assembly:
.file "hello.c"
.text
.section .rodata
.LC0:
.string "Hello World"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
leaq .LC0(%rip), %rax
movq %rax, %rdi
call puts@PLT
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 12.2.0"
.section .note.GNU-stack,"",@progbits
Beneath that main:
line, you can see that the computer only gets a string from us (“Hello World\n”) and the location in memory of the function we want to call (printf
). That call
line is calling the puts
function a bunch, which is how printf
is typically implemented on most machines. The point being: there is no implicit understanding by the computer of the higher level operation that we are performing. The computer just moves stuff around in memory. If an AI were to become sentient, I’d imagine it would first understand the higher-level purpose of this code before it could even consider trying to understand the thoughts and emotions of English literature or speech.
Github Copilot is a step closer to this, but it’s still not understanding the code, just merely providing the code that is likely to come next based on its analysis of billions of other lines of code.
In fact, I might have been too generous to the computer. Really, the output of gcc -S hello.c
just shows the assembly code, not what the computer understands. What the computer sees is too long to put here, but here is the first 90 lines of the binary output from xxd
:
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
00000010: 0300 3e00 0100 0000 4010 0000 0000 0000 ..>.....@.......
00000020: 4000 0000 0000 0000 0847 0000 0000 0000 @........G......
00000030: 0000 0000 4000 3800 0d00 4000 2500 2400 ....@.8...@.%.$.
00000040: 0600 0000 0400 0000 4000 0000 0000 0000 ........@.......
00000050: 4000 0000 0000 0000 4000 0000 0000 0000 @.......@.......
00000060: d802 0000 0000 0000 d802 0000 0000 0000 ................
00000070: 0800 0000 0000 0000 0300 0000 0400 0000 ................
00000080: 1803 0000 0000 0000 1803 0000 0000 0000 ................
00000090: 1803 0000 0000 0000 1c00 0000 0000 0000 ................
000000a0: 1c00 0000 0000 0000 0100 0000 0000 0000 ................
000000b0: 0100 0000 0400 0000 0000 0000 0000 0000 ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000d0: 3006 0000 0000 0000 3006 0000 0000 0000 0.......0.......
000000e0: 0010 0000 0000 0000 0100 0000 0500 0000 ................
000000f0: 0010 0000 0000 0000 0010 0000 0000 0000 ................
00000100: 0010 0000 0000 0000 6101 0000 0000 0000 ........a.......
00000110: 6101 0000 0000 0000 0010 0000 0000 0000 a...............
00000120: 0100 0000 0400 0000 0020 0000 0000 0000 ......... ......
00000130: 0020 0000 0000 0000 0020 0000 0000 0000 . ....... ......
00000140: b400 0000 0000 0000 b400 0000 0000 0000 ................
00000150: 0010 0000 0000 0000 0100 0000 0600 0000 ................
00000160: d02d 0000 0000 0000 d03d 0000 0000 0000 .-.......=......
00000170: d03d 0000 0000 0000 4802 0000 0000 0000 .=......H.......
00000180: 5002 0000 0000 0000 0010 0000 0000 0000 P...............
00000190: 0200 0000 0600 0000 e02d 0000 0000 0000 .........-......
000001a0: e03d 0000 0000 0000 e03d 0000 0000 0000 .=.......=......
000001b0: e001 0000 0000 0000 e001 0000 0000 0000 ................
000001c0: 0800 0000 0000 0000 0400 0000 0400 0000 ................
000001d0: 3803 0000 0000 0000 3803 0000 0000 0000 8.......8.......
000001e0: 3803 0000 0000 0000 4000 0000 0000 0000 8.......@.......
000001f0: 4000 0000 0000 0000 0800 0000 0000 0000 @...............
00000200: 0400 0000 0400 0000 7803 0000 0000 0000 ........x.......
00000210: 7803 0000 0000 0000 7803 0000 0000 0000 x.......x.......
00000220: 4400 0000 0000 0000 4400 0000 0000 0000 D.......D.......
00000230: 0400 0000 0000 0000 53e5 7464 0400 0000 ........S.td....
00000240: 3803 0000 0000 0000 3803 0000 0000 0000 8.......8.......
00000250: 3803 0000 0000 0000 4000 0000 0000 0000 8.......@.......
00000260: 4000 0000 0000 0000 0800 0000 0000 0000 @...............
00000270: 50e5 7464 0400 0000 1020 0000 0000 0000 P.td..... ......
00000280: 1020 0000 0000 0000 1020 0000 0000 0000 . ....... ......
00000290: 2400 0000 0000 0000 2400 0000 0000 0000 $.......$.......
000002a0: 0400 0000 0000 0000 51e5 7464 0600 0000 ........Q.td....
000002b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000002c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000002d0: 0000 0000 0000 0000 1000 0000 0000 0000 ................
000002e0: 52e5 7464 0400 0000 d02d 0000 0000 0000 R.td.....-......
000002f0: d03d 0000 0000 0000 d03d 0000 0000 0000 .=.......=......
00000300: 3002 0000 0000 0000 3002 0000 0000 0000 0.......0.......
00000310: 0100 0000 0000 0000 2f6c 6962 3634 2f6c ......../lib64/l
00000320: 642d 6c69 6e75 782d 7838 362d 3634 2e73 d-linux-x86-64.s
00000330: 6f2e 3200 0000 0000 0400 0000 3000 0000 o.2.........0...
00000340: 0500 0000 474e 5500 0280 00c0 0400 0000 ....GNU.........
00000350: 0100 0000 0000 0000 0100 01c0 0400 0000 ................
00000360: 0100 0000 0000 0000 0200 01c0 0400 0000 ................
00000370: 0000 0000 0000 0000 0400 0000 1400 0000 ................
00000380: 0300 0000 474e 5500 0585 b53f f43b 4d7c ....GNU....?.;M|
00000390: 511e 3f20 3460 c080 9314 6302 0400 0000 Q.? 4`....c.....
000003a0: 1000 0000 0100 0000 474e 5500 0000 0000 ........GNU.....
000003b0: 0400 0000 0400 0000 0000 0000 0000 0000 ................
000003c0: 0100 0000 0100 0000 0100 0000 0000 0000 ................
000003d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000003e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000003f0: 0000 0000 0000 0000 0600 0000 1200 0000 ................
00000400: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000410: 4800 0000 2000 0000 0000 0000 0000 0000 H... ...........
00000420: 0000 0000 0000 0000 0100 0000 1200 0000 ................
00000430: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000440: 6400 0000 2000 0000 0000 0000 0000 0000 d... ...........
00000450: 0000 0000 0000 0000 7300 0000 2000 0000 ........s... ...
00000460: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000470: 1800 0000 2200 0000 0000 0000 0000 0000 ...."...........
00000480: 0000 0000 0000 0000 0070 7574 7300 5f5f .........puts.__
00000490: 6c69 6263 5f73 7461 7274 5f6d 6169 6e00 libc_start_main.
000004a0: 5f5f 6378 615f 6669 6e61 6c69 7a65 006c __cxa_finalize.l
000004b0: 6962 632e 736f 2e36 0047 4c49 4243 5f32 ibc.so.6.GLIBC_2
000004c0: 2e32 2e35 0047 4c49 4243 5f32 2e33 3400 .2.5.GLIBC_2.34.
000004d0: 5f49 544d 5f64 6572 6567 6973 7465 7254 _ITM_deregisterT
000004e0: 4d43 6c6f 6e65 5461 626c 6500 5f5f 676d MCloneTable.__gm
000004f0: 6f6e 5f73 7461 7274 5f5f 005f 4954 4d5f on_start__._ITM_
00000500: 7265 6769 7374 6572 544d 436c 6f6e 6554 registerTMCloneT
00000510: 6162 6c65 0000 0000 0200 0100 0300 0100 able............
00000520: 0100 0300 0000 0000 0100 0200 2700 0000 ............'...
00000530: 1000 0000 0000 0000 751a 6909 0000 0300 ........u.i.....
00000540: 3100 0000 1000 0000 b491 9606 0000 0200 1...............
00000550: 3d00 0000 0000 0000 d03d 0000 0000 0000 =........=......
00000560: 0800 0000 0000 0000 3011 0000 0000 0000 ........0.......
00000570: d83d 0000 0000 0000 0800 0000 0000 0000 .=..............
00000580: e010 0000 0000 0000 1040 0000 0000 0000 .........@......
00000590: 0800 0000 0000 0000 1040 0000 0000 0000 .........@......
The far right column is xxd
showing some of the assembly and the far left column is the memory address, so it’s really the middle columns which show what the computer understands about the instructions. As we can see, it isn’t something we can really call understanding. It is much later in the program we actual find where our string is stored:
00002000: 0100 0200 4865 6c6c 6f20 576f 726c 6400 ....Hello World.
So, I for one do not believe 0100 0200 4865 6c6c 6f20 576f 726c 6400
shows that the computer understands this program is printing out hello world, or that the computer even understands what the string “Hello World” means, but you’re welcome to interpret this as you please.
Overall, perhaps it would be more accurate to say that computers are capable of parsing a subset of languages (ie. English or Spanish), and even then it would only be able to understand a smaller subset (a sub-subset, if you will) of the dialects in each language. As we’ve learned through testing, the whisper AI is not able to understanding the Hatsune Miku dialect of English, but it seems serviceable for Appalachian accents or a New York accent.
I hope I’ve made clear the difference between transcribing a language and understanding a language, and helped dispel the “AI is sentient” craze that has been taking over mass media the past several months. I don’t believe we’ll see a computer that truly understands human language in our lifetimes, but I’m hopeful the transcribing and translation gets good enough that I can watch untranslated anime without needing to learn Japanese first.
Site Licensing Site last updated: 2024-10-29