This post chronicles my ultimately failed attempt to extract subtitles with ffmpeg / avconv from a .ts DVB-S video-file recorded from live television.
I started with a search for anyone else having tried that and found many pointers on stackoverflow and such, but all were referring to .srt subtitling. As it seems a lot of people are transcoding, remuxing and repackaging their anime, but only very few people try to extract subs from broadcasted video streams. For example, the ffmpeg docs provide this example.
My ffmpeg can decode and encode all we'd need, right?
Wrong! When you do a naive conversion with ffmpeg, extracting the dvb_sub program-stream from the .ts file and sending it to, let's say, an .srt file, you'll get this dreaded ffmpeg / avconv error: "Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height"
The explanation behind this is: there are basically two formats in subtitling - image and text based - and most subs discussed on the web are text based, like .srt, .stl, .webvtt; but we here are facing an image-based sub format! (This user here (Is it possible to extract SubRip (SRT) subtitles from an MP4 video with ffmpeg?) had the same learning curve.)
The MPEG legacy has brought us here, I think, because on DVDs and obviously in .ts broadcast streams subs are in image-based formats, VobSub, dvb_sub, dvd_sub.
To see what I'm talking about, refer to vlc's comparison of subtitle formats, as Wikipedia doesn't have one.
For example, try something like this with a similar file:
ffmpeg -i 000.ts -map 0:0 -vcodec copy -acodec copy -map 0:s:1 -scodec dvdsub test.mkv
Just exchange dvdsub with srt for example, and you'll get the above mentioned error. What the command does, on the other hand, is transcoding the subsitles found in the second subtitle stream (streams are zero based, so "1" is the second; which is stream 0:6 in this file, and 0:5, the first sub-stream has teletext). It's beyond me what the actual difference between dvb_subtitles and dvd_subtitles is. But when you watch the result in vlc (mplayer has problems displaying them..), the quality of the subs has degraded. (But this detail only as a side-note, I think it's from DVB Subs being in 16bit BMP / PGM format, while the transcode wrote DVD Subs in 4 bit bitmaps, losing quality in alpha or similar).
With that learned and at least some working command for ffmpeg, I stumbled over this mailing-list post (Can extract DVB-Sub, cannot extract DVB-T), where someone had issues with dumping subtitles from a .ts file. And it gave me this command, which actually worked in dumping the raw subtitles-only stream to a file: ffmpeg -i file1.ts -vn -an -scodec copy -f rawvideo dvbsub.dat This post here discusses something similar. My variation of it was
ffmpeg -i 000.ts -map 0:0 -vcodec copy -acodec copy -map 0:s:1 -scodec copy -f rawvideo sub.data
But looking at the file with a hex editor, with me being a hex noob, brought nothing resembling a BMP or PGM file, any headers or structures I'd recognise.
So how can I extract image-based subtitles with ffmpeg? First I tried if piping a substream to an image format would work:
ffmpeg -i 000.ts -map 0:0 -vn -an -map 0:s:1 -scodec copy -f image2 sub_%03d.bmp
and this actually wrote many images but all were unusable. I don't know how ffmpeg actually chopped the data into files, based on timecode-subtitle-triggers?? I don't know. Finally giving up on ffmpeg, I asked the search engine if any other tools were able to extract image-based subtitles as rendered images/pictures from the video. Some claimed mencoder could do that and I actually found example commands, but none worked for me and all examples centered around DVD and VobSub format type of work, like writing .idx and .sub files from DVD etc.
This post then, although discussing a VOB workflow has pointers into the only feasible way of converting an image-based sub-stream into something text-based or into raw text. There the author used mencoder and a tool called vobsub2pgm and finally sends the resulting character images into an OCR solution. This post does something similar and uses tesseract for OCR. Ffmpeg can't do that, and so far only the other way round, encoding/rendering textual chars as images has been mentioned for ffmpeg in this ticket to add rasterization for sub transcoding.
Just before giving up, as I wasn't inclined to go a painful console-based path of trial and error with multiple tools just to extract some subtitles, I found that AviDemux offers a GUI tool to do just that, OCR'ing of image-based subs! Found in the "> Tools" menu, the older routine is called "OCR (VOBSub -> srt)" and more recent avidemux builds have "OCR (TS->srt)".
Running this on my .ts file didn't work. If you're interested, this page has screenshots of the workflow, which is a bit cumbersome, as OCR is not perfect and you eventually have to edit what's being recognised.
All I got was a weird error "backdoor) >> 16" and something which brought me to this thread, which once more mentioned a DVB tool called ProjectX. Despite the generic name, it's a very dedicated tool, focusing on inspecting and decoding DVB style .ts files as streamed by European broadcasters. And users on the forums say it's able to extract subtitles from the video mux.
And although it processed my .ts file, and printed all sorts of very involved looking things about packets, streams and elements found in the stream, I was not successful in properly targeting and extracting the "sub-picture" teletext subtitles stream found in my MPEG transport stream file. And that's the end of it. Post a comment when you have tried something similar and can provide pointers.
How to extract subs with avconv
Dump subtitles with ffmpeg
Extract subs with ffmpeg and write images / bitmaps / ocr
How to write subtitles from a video into a separate srt stl subtitles file