History of Web Multimedia#
- PC Era: Playback plugins like Flash, rich clients.
- Mobile Internet Era: Flash gradually phased out, HTML5 emerged, but its support for video formats is limited.
- Media Source Extensions, supporting various video formats.
Basic Knowledge#
Encoding Formats#
Basic Concepts of Images#
- Image Resolution: Used to determine the pixel data that makes up an image, referring to the number of pixels in the horizontal and vertical directions of the image.
- Image Depth: Refers to the number of bits required to store each pixel. Image depth determines the possible number of colors or possible grayscale levels for each pixel.
- For example, a color image represents each pixel with R, G, and B components, each component uses 8 bits, making the pixel depth 24 bits, which can represent 2 to the power of 24 colors, or 16777216 colors;
- A monochrome image requires 8 bits to store each pixel, so the image's pixel depth is 8 bits, with a maximum grayscale number of 2 to the power of 8, which is 256.
 
- Image resolution and image depth together determine the size of the image.
Basic Concepts of Video#
- Resolution: The resolution of each frame of the image.
- Frame Rate: The number of video frames contained in a unit time.
- Bit Rate: Refers to the amount of data transmitted per unit time in the video, usually expressed in kbps, which means kilobits per second.
- Resolution, frame rate, and bit rate together determine the size of the video.
Classification of Video Frames#
I-frames, P-frames, B-frames
I-frame (Intra-coded frame): An independent frame that contains all information, can be decoded independently without relying on other frames.
P-frame (Predictive-coded frame): Can only be encoded by referencing previous I-frames or P-frames.
B-frame (Bidirectionally predictive-coded frame): Depends on both previous and subsequent frames, representing the difference between this frame and the frames before and after it.
1 -> 2 -> 3 ->.....
DTS (Decode Time Stamp): Determines when the bitstream starts being sent to the decoder for decoding.
PTS (Presentation Time Stamp): Determines when the decoded video frame is displayed.
In the absence of B-frames, the order of DTS and PTS should be the same.
GOP (Group of Pictures)#
The interval between two I-frames, usually between 2 to 4 seconds.
If there are many I-frames, the video will be larger.
Why Encode?#
Video resolution: 1920 × 1080
So the size of one image in the video: 1920 × 1080 × 24/8 = 6220800 Bytes (5.2M)
Thus, a video with a frame rate of 30 FPS and a duration of 90 minutes would occupy: 933G, too large!
Not to mention the higher 60 FPS...
What does encoding compress?
- First, spatial redundancy:
- Temporal redundancy: ↓ Only the position of the ball has changed; everything else remains unchanged.
- 
Encoding redundancy: For the image shown, blue can be represented by 1 and white by 0 (because there are only these two colors, using a certain Huffman encoding method). 
- 
Visual redundancy 
Encoding Data Processing Flow#
Remove spatial and temporal redundancy through prediction -> Transform to remove spatial redundancy.
- Quantization to remove visual redundancy: Remove things that the visual system cannot easily perceive.
- Entropy encoding to remove encoding redundancy: Characters that appear frequently require shorter encoding lengths.
Encapsulation Formats#
The above video encoding only stores pure video information.
Encapsulation format: A container that stores audio and video, images, or subtitle information.
Multimedia Elements and Extended APIs#
video & audio#
<video> tag is used to embed a media player in HTML or XHTML documents to support video playback within the document.
<!DOCTYPE html>
<html>
<body>
    <video src="./video.mp4" muted autoplay controls width=600 height=300></video>
    <video muted autoplay controls width=600 height=300>
        <source src="./video.mp4"></source>
    </video>
</body>
</html>
<audio> element is used to embed audio content in the document.
<!DOCTYPE html> 
<html>
<body>
    <audio src="./audio.mp3" muted autoplay controls width=600 height=300></audio>
    <audio muted autoplay controls width=600 height=300>
    	<source src="./audio.mp3"></source>
    </audio>
</body>
</html>
| Method | Description | 
|---|---|
| play() | Starts playing audio/video (asynchronously) | 
| pause() | Pauses the currently playing audio/video | 
| load() | Reloads the audio/video element | 
| canPlayType() | Detects whether the browser can play the specified audio/video type | 
| addTextTrack() | Adds a new text track to the audio/video | 
| Property | Description | 
|---|---|
| autoplay | Sets or returns whether the video should play automatically after loading. | 
| controls | Sets or returns whether audio/video controls are displayed (like play/pause, etc.) | 
| currentTime | Sets or returns the current playback position in the audio/video (in seconds) | 
| duration | Returns the length of the current audio/video (in seconds) | 
| src | Sets or returns the current source of the audio/video element | 
| volume | Sets or returns the volume of the audio/video | 
| buffered | Returns a TimeRanges object representing the buffered portion of the audio/video | 
| playbackRate | Sets or returns the speed at which the audio/video is played. | 
| error | Returns a MediaError object representing the error state of the audio/video | 
| readyState | Returns the current ready state of the audio/video. | 
| ... | ... | 
| Event | Description | 
|---|---|
| loadedmetadata | Triggered when the browser has loaded the metadata of the audio/video | 
| canplay | Triggered when the browser can start playing the audio/video | 
| play | Triggered when the audio/video has started or is no longer paused | 
| playing | Triggered when the audio/video is ready after being paused or stopped due to buffering | 
| pause | Triggered when the audio/video has been paused | 
| timeupdate | Triggered when the current playback position has changed | 
| seeking | Triggered when the user starts moving/jumping to a new position in the audio/video | 
| seeked | Triggered when the user has moved/jumped to a new position in the audio/video | 
| waiting | Triggered when the video stops due to needing to buffer the next frame | 
| ended | Triggered when the current playlist has ended | 
| ... | ... | 
Limitations#
- audio and video do not support direct playback of formats like hls, flv, etc.
- Requests and loading of video resources cannot be controlled by code, thus the following functionalities cannot be achieved:
- Segment loading (to save bandwidth)
- Seamless switching of quality
- Precise preloading
 
MSE (Extended API)#
Media Source Extensions (MSE)
- 
Play streaming media on the web without plugins 
- 
Supports playback of video formats such as hls, flv, mp4, etc. 
- 
Can achieve segmented loading of video, seamless switching of quality, adaptive bitrate, precise preloading, etc. 
- 
Supported by major browsers, except for Safari on iOS. 
- Create a mediaSource instance.
- Create a URL pointing to the mediaSource.
- Listen for the sourceopen event.
- Create a sourceBuffer.
- Add data to the sourceBuffer.
- Listen for the updateend event.
- Player playback flow
Streaming Media Protocols#
HLS stands for HTTP Live Streaming, a media streaming protocol based on HTTP proposed by Apple for real-time audio and video streaming. Currently, the HLS protocol is widely used in video on demand and live broadcasting.
Application Scenarios#
- Video on demand/live streaming -> Video upload -> Video transcoding
- Images -> Support for some new image formats
- Cloud gaming -> No need to download cumbersome clients, running on remote servers, video streams propagate back and forth (high requirements for latency)
Summary and Reflections#
In this lesson, the teacher popularized the basic concepts of Web multimedia technology, such as encoding formats, encapsulation formats, multimedia elements, streaming protocols, etc., and elaborated on various application scenarios of Web multimedia.
Most of the content cited in this article comes from Teacher Liu Liguo's class and MDN.