Video transcription project
This page describes the video transcription project for the elinux wiki.
The purpopse of this project is to create transcripts of embedded Linux talks, as pages on the elinux wiki, via crowdsourcing. The value of transcripts would be that the material in the talks would be much more accessible to people. It would be searchable, and the talk could be read instead of watched (which is time-consuming). This essentially makes the talk accessible via random access instead of sequential access. There is a lot of very good content that is preserved in the videos that have been made over the years, and the goal of this project is to make that content more accessible and usable.
Instructions for Volunteers
There are two main tasks involved in this project, and as a volunteer you can participate in either or both of these tasks:
- creating session pages for embedded Linux talks
- creating transcriptions for the talks (on the session pages).
There is a single master page holding all the talks we would like to transcribe, located at:
This will be the master list that we build out over time. It is intended to have a link for all talks or sessions that would be interesting to embedded Linux developers, from the last several years.
Creating a session page
The transcript for each presentation, talk or session is held on a session page for that talk. All of the session pages on the wiki are linked to from the master list:
To create a session page for a talk, please do the following:
- Find a talk (that has video available on the Internet)
- find at least the following information about the talk (needed for the list page):
- event where the talk occured
- name of presenter and their organization
- title of the presentation
- In one browser tab, open the Session:Template for editing (so you can copy and paste its contents)
- Edit the List of Embedded Linux Presentations
- Create an entry for the talk in the table of presentation on the list page. Follow the style of linking that is used for other sessions.
- Please select a unique name for the talk (possibly adding a date if needed to distinguish the talk from others with similar names)
- The session page name should always start with the prefix: "Session:"
- Save the List page.
- Click on the link, to create the session page for the talk
- Fill in the information for the talk with as much information as you can
- Make sure that you include a link to the slides (if available separately) and the video
- Save the session page
Adding to a transcription
By our estimate, it takes about 10 hours to transcribe a talk, if done by a human. This is based on an estimate of about 10 minutes of transcription work for each minute of video. In order to make this easy an scalable, we have broken down the transcript into 1-minute intervals.
To add to a transcript, please do the following: 1) select a talk from the list - click on the link to get to the session page for that talk 2) click on the video link for that talk 3) scroll to an area of the talk that is not transcribed yet (e.g. minute 17) 4) edit the session page (specifically, the transcription area of the page) 5) listen to the talk, and add the words from the talk to the transcription 6) click on "Save page" to save your changes. You can also add yourself to the list of transcribers.
It's that easy! Thanks for your help.
Verifying a transcription
To verify a transcript, roughly follow the instructions above, but instead of adding to the transcript, listen to a section which is not yet marked as verified. When you have listened and verified a particular minute of the talk, please mark the minute-heading for that minute with an 'vx' marker, replacing x with your verifier number.
Eventually, this page will list all embedded Linux talks for the last several years. It in intended that
Process and methodology
In order to crowdsource the effort, a structure and method will be developed for:
- keeping track of the completion status of each talk
- managing the transcripts and keeping them in a uniform and useful form
- keeping track of who has worked on each talk (transcript credits)
- reviewing the transcripts for accuracy (and vandalism)
Random idea: Maybe use a different wiki for the talk pages, that allows database operations on the pages (like websed/tbwiki tables).
Process per page
1. check if page is listed in global presentation list 2. create a presentation page for the presentation (using a template?) 3. add a link to the talk in the global presentation list 4. add a link to the talk in the global topic list
Here's a link to the List of Embedded Linux Presentations
- add a status column to each conference's talks page, indicating the transcript status and linking to the talk page.
- add a page for each page, with the presentation's information (title, description, author, transcript, link to video, etc.)
- make a template for a presentation page
- solicit participation in the project
- make a call for volunteer transcribers on celinux-dev and at ELC Europe
Historical work, other references
See All Topics - work done by Devin Flake to categorize past presentation material, from 2008
- need to investigate a way to have the presentation pages be consistent. MoinMoin has template pages, what does wikimedia support?
- investigate wikimedia page creation
- This seems to be supported, in a very limited way, and only with plugins, via a "preloading" option.
- MoinMoin is much better here, suggesting at page creation time any page that has a template with a prefix or suffix that matches the to-be-created page.
- I guess we'll have to settle for just trying to be consistent
- Maybe the page creation could be automated
- Should I write a script to create a presentation page, and associated link on the 'list of presentations' page?
- investigate wikimedia page creation
- need to create global presentation list (working on it, decided to add session incrementally)
- have a separate page for presentations by topic?
The first session I tried was the Samsung fragmentation page from ELC - unfortunately, there was no video for this. The second session I tried was for Mike Anderson's keynote from ELC 2012 - setting up the page was a bit of a pain. Very clear instructions are needed for this. It should be as mechanical as possible, with little room for error.
How much time does it take, per minute of video?
Format of YouTube caption file: (See http://support.google.com/youtube/bin/static.py?hl=en&topic=2734694&guide=2734661&page=guide.cs)
0:00:03.490,0:00:07.430 >> FISHER: All right. So, let's begin. This session is: Going Social 0:00:07.430,0:00:11.600 with the YouTube APIs. I am Jeff Fisher, 0:00:11.600,0:00:14.009 and this is Johann Hartmann, we're presenting today. 0:00:14.009,0:00:15.889 [pause]
They use square brackets for comments like [laughter], [music], or [pause], and angle brackets to indicate a speaker change.
Starting and stopping the video is painstaking. Backing up the video is difficult to be precise at.
For Mike's talk, it took me about 6 minutes of transcribing for each minute of video. I was trying to go as fast as possible and timing myself. I think a reasonable estimate for each minute of video is about 10 minutes of time (set up, actual transcription, corrections, saving). This is when done out of context (a single minute session).
At this rate, 6 minutes of video would take an hour to transcribe, so a 60 minute video would therefore take about 10 hours to transcribe. Give another 2 hours for a second pass, and we're talking about 12 hours per video. That's a lot, and clearly outside the scope of what one person would be expected to do.
Note that Mike's talk had very good audio, and Mike speaks slowly and clearly with no accent, so this is probably a best-case scenario.
Popcorn maker popcorn maker allows you to augment a video on the web?
Video mashup tools (JayCut, JumpCut, MovieMasher, Mix and Mash,
Youtube supports video annotations: http://www.youtube.com/watch?v=UGeQKMJIHx8