MPEG-4, what's it good for?

This article was written by Alistair Jackson of EditHouse, and was first published in Broadcast Engineering News magazine in December 2002.

The initial MPEG-4 specification was finalised in 1998 and yet we have seen remarkably little of it implemented. There's been plenty of hype but little product, and now the marketing people for Microsoft Media Player are trying to tell us that MPEG-4 is already outdated. What is the true state of affairs?

When the MPEG-4 committee were first put to the task in 1993, they were working under the title of "very low bit rate audio-visual coding". As they progressed, their raison d'être evolved into "coding of audio-visual objects". I can't help but feel that this change added a certain irreconcilable rift in their purpose. At one end of the scale, you have technology that was aimed at video conferencing, but as time has progressed, they have expanded all the way up to HDTV. If it works, then their all-encompassing aim should simplify audio and video standards considerably, but only if manufacturers are willing to get on the bandwagon.

MPEG-2 set the precedent when they overstepped their brief (making MPEG-3 redundant before its release), and MPEG-4 has done this to a far greater extreme, covering video ranges from a few kilobits per second through to giga bits per second. There is support for progressive and interlaced video, multiple video objects, surround sound, and pretty well anything else you can think of. The full collection of profiles covers transmission and storage data levels for everything from HD post-production requirements, to Internet streaming and wireless devices. Areas as diverse as cinema distributors and mobile phone companies are looking at MPEG-4 as their enabling technology.

As has been reported many times over the last year or two, the real holdups with MPEG-4 haven't been over the technology, but rather over administrating the myriad of individual patents that are incorporated into the final product. Around 18 patented technologies are required to get the most basic versions of MPEG-4 working. For the full toolbox of profiles and levels, there are many, many more to be considered.

The MPEG-4 committee rightly or wrongly stood back from the legal administration side of things and simply developed an incredibly thorough and multi-purpose standard. They've left it to others to figure out how the technology can be used without infringing patents.

A separate, but related group called the MPEG-4 Industry Forum (M4IF) have been charged with the task of playing middlemen between the MPEG-4 committee and the patent holders. M4IF are a not for profit organisation assigned the duty of promoting MPEG-4. They are also tasked with certifying MPEG-4 products through interoperability testing and with suggesting solutions to the patent licensing difficulties. To this end, they have promoted the founding of Patent Pools. The main example of this is the company MPEG LA who have put together licensing packages for both MPEG-2 and MPEG-4.

As a licensing administrator, MPEG LA are a private company who have secured licensing rights from the many patent holders who can lay claim to the enabling technologies used by certain Profiles of MPEG-2 and MPEG-4. While they have copped a lot of flack for their initial rights package for MPEG-4, it is important to realise that they are a totally separate body from MPEG. There is nothing to stop a developer from negotiating their own deal with the individual patent holders. However, this would no doubt be a very expensive administrative and legal process, which is why M4IF are encouraging licensing administrators.

The different MPEG committees define lists of profiles to suit all their members. When most people refer to MPEG-2 they are talking about the MP@ML (main profile at main level) used by DVB and DVD or the 422P@ML used by Sony for their SX and IMX formats. Presumably, MPEG-4 will follow a similar pattern with some profiles becoming commonplace and others never leaving the test bench.

The big difference is that, as opposed to MPEG-2's seven video profiles, MPEG-4 has dozens, with future additions on the drawing board. Any implementation of MPEG-4 is defined by choosing the desired profiles for each of the media types (sound, vision and scene descriptors). Further more, many of these profiles are broken up into levels, which define the minimum decoder complexity required for that profile at that particular level.

All MPEG-4 content is made up of Media Objects. These objects can be sound, vision or mixed audiovisual. The term natural is used to refer to material that was captured from the real world, with a camera or microphone, and the term synthetic refers to content that was created on a computer.

In its simplest form, there is just one rectangular media object, which is of course a straightforward video frame. It is encoded in a similar way to MPEG-1 and MPEG-2 frames, and you end up with similar results, although with somewhat better compression ratios. Where MPEG-4 draws away from this is with its more complex compound media objects, which are used to group primitive media objects together. They allow different objects to be encoded with different spatial and temporal values. Thus a foreground actor can be compressed at a high quality and with a high refresh rate, while the background scenery can have less detail and be refreshed less often.

Not only do you save bandwidth from the different compression levels, but as the whole background exists in memory, when the foreground object moves around, the missing bits revealed behind it don't have to be re-transmitted.

MPEG has defined a binary language for scene description called Binary Format for Scenes or BIFS. Audio and Visual objects are positioned in both space and time. These spatial and temporal positions are defined through a local coordinate system. More advanced BIFS can be used for 2-D and 3-D modelling, body animation and sound environment modelling.

Media objects can be arbitrary-shaped with transparency information defining their edges. They can include such things as 2-D and 3-D vector graphics with texture maps, face and body animation on avatars and synthesised audio. This audio includes a MIDI like system and a text to speech engine. Under certain profiles media objects can be interactively manipulated by the viewer. Once again it's a matter of what profile and level you're dealing with.

For example, an existing television program that uses a chroma keyed host or a virtual studio, could easily re-jig itself to work in a multipurpose MPEG world. The main feed could remain MPEG-2 or be made MPEG-4. Broadband viewers could receive a Scaleable connection that would adapt itself to their connection speed. MPEG-4 could make data savings by sending the keyed hosts image as a separate object to the background with a different refresh rate. Each MPEG-4 decoder would perform the actual compositing of the picture.

For very low bandwidth users, such as mobile phone connections, the talking head could actually be replaced with a "synthetic" head from MPEG-4's face animation profile. An appropriate system would need to be found to match the head to the real newsreaders expressions and mouth movement. With this could be sent the newsreaders real voice heavily compressed, or a synthetic transcript, which would use the text to voice profile. If captioning is being entered live, then that could be used as a source for the synthetic speech. This text can be used to set parameters to allow animation of the synthetic face.

With certain implementations, it is possible for the viewer to control the positioning of these media objects. Thus with the correct MPEG-4 authoring, a viewer can move or delete an object, such as a person, from the compound image to see what's behind them. As with all the advanced features of MPEG-4, the standard simply supports the ability to transfer the images in this manner, creating the tools and the content to make such options possible is somebody else's problem.

To allow this sort of thing to be done, MPEG-4 includes the ability to send Java applications within the data stream. This system is called MPEG-J (not to be confused with m-JPEG, which is of course something entirely different). Among its many uses, Java can query a terminal, set a viewing environment or provide user interaction either locally, or through a back channel if one exists. In an atypical burst of frivolity, the committee has deemed to name these Java applications "MPEGlets".

Through an optional Intellectual Property Identification (IPI) data set, each individual media object can be assigned a unique tag to identify the current rights holder. These can be issued through international numbering system organisations. Alternately, a key value pair can be used, such as "designer/Betty Rubble". Through this method, all the elements making up a compound media object could have a different copyright owner. The music, voice over, background still, animated purple fluffy creature and John Howard impersonator could all be created separately, hold separate copyright and all end up combined on the screen through compound media objects. As each elementary data stream can be stored and transmitted separately, it leaves you wondering whether this technology will lead to individually created and owned characters meeting on virtual sets in cyber space.

The MPEG-4 encoder has to go to considerable lengths to make sure the bit stream it creates will be compliant with all decoders of the appropriate profile and level. A complex Video Buffer Verification mechanism is used to ensure that the individual media objects and final compound object don't overtax decoders in terms of either memory or computational requirements.

MPEG-4 is in the process of adding the impressive, H.264 codec to its ever increasing list of profiles. Officially know as MPEG-4 part 10 "Advanced Video Coding", the technology is said to give DVD quality results at less than 1Mbps. MPEG LA have already put out a call for relevant patent holders to register there interest for the upcoming licensing package, which they intend to release mid 2003.

(Fig. 1) The full list of profiles is ever expanding, but some common areas are shown above.
Media Profiles Scene Description MPEG-J
Audio Profiles Visual Profiles Graphics Profiles Profiles Profiles
Main Simple Simple 2D Audio Main
Scaleable Advanced Simple Complete 2D Simple 2D Personal
Synthetic Main Complete (2D & 3D) Complete 2D  
Speech Core   Complete  
MAUI Hybrid      
High Quality (ACC) Face/Body Animation      
Low Delay Audio Simple/Core Scaleable      
Simple/Core Studio      

There are a couple of major hurdles for MPEG-4 to deal with. The huge number of profiles continues to cause confusion as to just what is the primary purpose of MPEG-4. Computer people have heard the term bandied around for a while now. At one point championed by Microsoft, and now trashed by them. Other people make the not entirely unreasonable assumption that MPEG-4 must be an upgrade from MPEG-2.

The other major issue is the less than brilliant performance of its biggest "patent clearing house" MPEG LA, who made something of a PR mess of their first MPEG-4 offering. The latest version is an improvement, including no patent royalties payable for content owners with fewer than 50,000 subscribers.

MPEG-4 is an open standard and history has shown that these standards are more successful than the latest bells and whistles proprietary system. Once a standard has been established and accepted, most consumers are unwilling to move onto the next big thing. They'd prefer something they can rely on. MPEG-4's success or failure will come down to whether or not it can get enough of a following before the next major standard comes along.

The author has made every effort to confirm the accuracy of the information in this article, however, it is his opinion only and comes with no guarantees. Please consult your family technologist if in doubt.