TED Talks Download Subtitles
Fecha: January 5th, 2010 | Categoría: Internet | 13 Comments »UPDATE: Online version
Go to the Online version
This is what I’ve been working on today. It’s a simple console-based script to download subtitles for TED Talks – since I haven’t found a way to download them directly from the web in a compatible format (I generally use ‘.srt’ subtitles). Here is the script made in python. TEDTalkSubtitles.py
Key parts of the program:
A simple function to parse the value in miliseconds to something like “00:34:32,334″:
-
def getFormatedTime(intvalue):
-
mils = intvalue%1000
-
segs = (intvalue/1000)%60
-
mins = (intvalue/60000)%60
-
hors = (intvalue/3600000)
-
return "%02d:%02d:%02d,%03d"%(hors,mins,segs,mils)
With this recursive function, fetch available languages for the talk
-
def availableSubs(subs):
-
a = subs.find("LanguageCode")
-
if a == -1:
-
return []
-
subs = subs[a+len("LanguageCode"):]
-
return [re.search("%22([^A-Z]+)%22", subs).group(1)] + availableSubs(subs)
Get information about the video
-
def getVideoParameters(urldirection):
-
ht = urllib.urlopen(urldirection).read()
-
var = re.search(‘flashVars = {\n([^}]+)}’, ht)
-
if var:
-
var = var.group(1)
-
else:
-
return None
-
var = [a.replace(‘\t‘, ”) for a in var.split(‘\n‘)]
-
for a in range(len(var)):
-
if var[a]:
-
var[a] = var[a][:var[a].rfind(‘,’)]
-
resultado = []
-
for a in var:
-
l = a.find(‘:’)
-
if l != -1:
-
resultado.append((a[:l], a[l+1:]))
-
return dict(resultado)
Getting it all together:
-
def downloadSub(idtalk, lang, timeIntro):
-
print("Downloading subtitles for language %s"%lang)
-
c = simplejson.load(urllib.urlopen(‘http://www.ted.com/talks/subtitles/id/%s/lang/%s’%(idtalk, lang)))
-
salida = file(’subs_%s_%s.srt’%(idtalk,lang), ‘w’)
-
conta = 1
-
c = c[‘captions’]
-
for linea in c:
-
salida.write("%d\n"%conta)
-
conta += 1
-
salida.write("%s –> %s\n"%(getFormatedTime(timeIntro+linea[’startTime’]), getFormatedTime(timeIntro+linea[’startTime’]+linea[‘duration’])))
-
salida.write("%s\n\n"%(linea[‘content’].encode(‘utf-8′)))
-
salida.close()
Related to:
Parsing and Converting TED Talks JSON Subtitles
Download subtitles from TED talks for offline viewing

Very interesting && useful script!
But I have to tell you that there’s a funny discordance between the language of the post && the language of the source code
greetings
Thank you for your effort and sharing it! I was just pondering on downloading and converting some ted subtitles and your contribution made it straightforward.
Give a try to my web edition:
http://diyism.com/?action=tool.ted_srt
@diyism: I was planning on doing the same, a web edition to download the subtitles. I’ll have it up and running as soon as I get a python hosting.
Thanks for the comments, @fern17 and @Petri!
[...] TED Talks Download Subtitles [...]
Thanks, I was looking for a way to watch some talks offline with ST. Great way of doing it.
eordano,
Your script is almost perfect, but it’s lacking a important issue: translineation.
Subtitles are coming like this:
1 00:00:16,500 –> 00:00:19,500 I’d like to talk to you today about the human brain, 2 00:00:19,500 –> 00:00:21,500 which is what we do research on at the University of California. 3 00:00:21,500 –> 00:00:23,500 Just think about this problem for a second.
Whereas the ideal would be to come like this:
1
00:00:16,500 –> 00:00:19,500
I’d like to talk to you today about the human brain,
2
00:00:19,500 –> 00:00:21,500
which is what we do research on at the University of California.
3
00:00:21,500 –> 00:00:23,500
Just think about this problem for a second.
Is it possible to make this little improvement?
Thanks in advance.
Ghuana,
The subtitles are saved with *nix end-of-lines. If you tested them, you’ll see that any subtitle-capable media player displays them correctly.
If you are in windows and want to manually edit them, I suggest you use Notepad++ and select the option “Unix End of Line” to see correctly the lines.
If you downloaded the script and are not using the online version, look at all te ocurrences of “\n” and replace them with “\r\n”
Yes, eordano.
It’s working perfectly. I’ve hastened myself. Your tool is amazing.
Thanks
Hey, great script ! I was just going to try and write the same thing, you saved me some hours of coding.
Since I need to make a few changes (minor, like adding the language on the command line, specifying an output file, etc), is there a license your code is put under ? BSD, GPL, other ?
Thanks !
Aurélien, have you checked out the online version I made? http://tedtalksubtitledownload.appspot.com
I had never thought of what license to put my code under. Thanks for asking. I’ve checked and I like the MIT license, I’m going to put somewher that all code in this blog will be under that licence.
Yeah, I’ve seen the online version, but I’ve written a script to read the TED RSS feed, convert the videos to my portable reader size, and put back the URLs in the feed.
I wanted to include the subtitles, so I need the independant version, which works perfectly.
By the way, if you’re interested, my script is here: http://gitorious.org/abompard-scripts/abompard-scripts/blobs/master/podcast-transcode.py
It’s not specific to TED, although the subtitles feature is, for now. I’m using it for various video podcasts.
Thanks again for your script, and for putting it under the MIT license !
Wonderful script! It’s working!