TED Talks Download Subtitles

Fecha: January 5th, 2010 | Categoría: Internet | 13 Comments »

UPDATE: Online version

Go to the Online version
This is what I’ve been working on today. It’s a simple console-based script to download subtitles for TED Talks – since I haven’t found a way to download them directly from the web in a compatible format (I generally use ‘.srt’ subtitles). Here is the script made in python. TEDTalkSubtitles.py

Key parts of the program:

A simple function to parse the value in miliseconds to something like “00:34:32,334″:

  1. def getFormatedTime(intvalue):
  2.     mils = intvalue%1000
  3.     segs = (intvalue/1000)%60
  4.     mins = (intvalue/60000)%60
  5.     hors = (intvalue/3600000)
  6.     return "%02d:%02d:%02d,%03d"%(hors,mins,segs,mils)

With this recursive function, fetch available languages for the talk

  1. def availableSubs(subs):
  2.     a = subs.find("LanguageCode")
  3.     if a == -1:
  4.         return []
  5.     subs = subs[a+len("LanguageCode"):]
  6.     return [re.search("%22([^A-Z]+)%22", subs).group(1)] + availableSubs(subs)

Get information about the video

  1. def getVideoParameters(urldirection):
  2.     ht = urllib.urlopen(urldirection).read()
  3.     var = re.search(‘flashVars = {\n([^}]+)}’, ht)
  4.     if var:
  5.         var = var.group(1)
  6.     else:
  7.         return None
  8.     var = [a.replace(\t, ) for a in var.split(\n)]
  9.     for a in range(len(var)):
  10.         if var[a]:
  11.             var[a] = var[a][:var[a].rfind(‘,’)]
  12.     resultado = []
  13.     for a in var:
  14.         l = a.find(‘:’)
  15.         if l != -1:
  16.             resultado.append((a[:l], a[l+1:]))
  17.     return dict(resultado)

Getting it all together:

  1. def downloadSub(idtalk, lang, timeIntro):
  2.     print("Downloading subtitles for language %s"%lang)
  3.     c = simplejson.load(urllib.urlopen(‘http://www.ted.com/talks/subtitles/id/%s/lang/%s’%(idtalk, lang)))
  4.     salida = file(’subs_%s_%s.srt’%(idtalk,lang), ‘w’)
  5.     conta = 1
  6.     c = c[‘captions’]
  7.     for linea in c:
  8.         salida.write("%d\n"%conta)
  9.         conta += 1
  10.         salida.write("%s –> %s\n"%(getFormatedTime(timeIntro+linea[’startTime’]), getFormatedTime(timeIntro+linea[’startTime’]+linea[‘duration’])))
  11.         salida.write("%s\n\n"%(linea[‘content’].encode(‘utf-8′)))
  12.     salida.close()

Related to:
Parsing and Converting TED Talks JSON Subtitles
Download subtitles from TED talks for offline viewing


13 Comments on “TED Talks Download Subtitles”

  1. 1 fern17 said at 8:01 am on January 11th, 2010:

    Very interesting && useful script!
    But I have to tell you that there’s a funny discordance between the language of the post && the language of the source code

    greetings

  2. 2 Petri said at 4:51 pm on January 23rd, 2010:

    Thank you for your effort and sharing it! I was just pondering on downloading and converting some ted subtitles and your contribution made it straightforward.

  3. 3 diyism said at 5:00 am on February 2nd, 2010:

    Give a try to my web edition:
    http://diyism.com/?action=tool.ted_srt

  4. 4 eordano said at 1:27 pm on February 2nd, 2010:

    @diyism: I was planning on doing the same, a web edition to download the subtitles. I’ll have it up and running as soon as I get a python hosting.

    Thanks for the comments, @fern17 and @Petri!

  5. 5 Ted Talks Subtitles Downloader | estebanordano.com.ar said at 4:17 am on February 5th, 2010:

    [...] TED Talks Download Subtitles [...]

  6. 6 Bernd said at 7:55 am on February 16th, 2010:

    Thanks, I was looking for a way to watch some talks offline with ST. Great way of doing it.

  7. 7 Ghuana said at 8:20 pm on March 1st, 2010:

    eordano,

    Your script is almost perfect, but it’s lacking a important issue: translineation.

    Subtitles are coming like this:

    1 00:00:16,500 –> 00:00:19,500 I’d like to talk to you today about the human brain, 2 00:00:19,500 –> 00:00:21,500 which is what we do research on at the University of California. 3 00:00:21,500 –> 00:00:23,500 Just think about this problem for a second.

    Whereas the ideal would be to come like this:

    1
    00:00:16,500 –> 00:00:19,500
    I’d like to talk to you today about the human brain,

    2
    00:00:19,500 –> 00:00:21,500
    which is what we do research on at the University of California.

    3
    00:00:21,500 –> 00:00:23,500
    Just think about this problem for a second.

    Is it possible to make this little improvement?

    Thanks in advance.

  8. 8 eordano said at 8:28 pm on March 2nd, 2010:

    Ghuana,

    The subtitles are saved with *nix end-of-lines. If you tested them, you’ll see that any subtitle-capable media player displays them correctly.

    If you are in windows and want to manually edit them, I suggest you use Notepad++ and select the option “Unix End of Line” to see correctly the lines.

    If you downloaded the script and are not using the online version, look at all te ocurrences of “\n” and replace them with “\r\n”

  9. 9 Ghuana said at 1:20 pm on March 3rd, 2010:

    Yes, eordano.

    It’s working perfectly. I’ve hastened myself. Your tool is amazing.

    Thanks

  10. 10 Aurélien said at 4:01 am on May 9th, 2010:

    Hey, great script ! I was just going to try and write the same thing, you saved me some hours of coding.

    Since I need to make a few changes (minor, like adding the language on the command line, specifying an output file, etc), is there a license your code is put under ? BSD, GPL, other ?

    Thanks !

  11. 11 eordano said at 11:02 am on May 9th, 2010:

    Aurélien, have you checked out the online version I made? http://tedtalksubtitledownload.appspot.com

    I had never thought of what license to put my code under. Thanks for asking. I’ve checked and I like the MIT license, I’m going to put somewher that all code in this blog will be under that licence.

  12. 12 Aurélien said at 11:16 am on May 9th, 2010:

    Yeah, I’ve seen the online version, but I’ve written a script to read the TED RSS feed, convert the videos to my portable reader size, and put back the URLs in the feed.
    I wanted to include the subtitles, so I need the independant version, which works perfectly.

    By the way, if you’re interested, my script is here: http://gitorious.org/abompard-scripts/abompard-scripts/blobs/master/podcast-transcode.py
    It’s not specific to TED, although the subtitles feature is, for now. I’m using it for various video podcasts.

    Thanks again for your script, and for putting it under the MIT license !

  13. 13 Charles said at 11:40 am on July 18th, 2010:

    Wonderful script! It’s working!


Leave a Reply