PROJET AUTOBLOG ~ Streisand Effect

Site original :

⇐ retour index

Mise à jour

Mise à jour de la base de données, veuillez patienter...

Scrcpy 2.0, with audio

Sunday 12 March 2023 à 02:30

I am thrilled to announce the release of scrcpy 2.0. Now, you can mirror (and record) your Android 11+ devices in real-time with audio forwarded to your computer!

This new version also includes an option to select the video and audio codecs. The device screen can now be encoded in H.265, or even AV1 if your device supports AV1 encoding (though this is unlikely).

The application is free and open source. Follow the instructions to install it and run it on your computer.

If you like scrcpy, you can support my open source work.


Audio usage

Audio forwarding is supported for devices with Android 11 or higher, and it is enabled by default:

You can disable audio with:

scrcpy --no-audio

If audio is enabled, it is also recorded:

scrcpy --record=file.mkv

Unlike video, audio requires some buffering even in real-time. The buffer size needs to be small enough to maintain acceptable latency, but large enough to minimize buffer underrun, which causes audio glitches. The default buffer size is set to 50ms, but it can be adjusted:

scrcpy --audio-buffer=30

To improve playback smoothness, you may deliberately increase the latency:

scrcpy --audio-buffer=200

This is useful, for example, to project your personal videos on a bigger screen:

scrcpy --video-codec=h265 --display-buffer=200 --audio-buffer=200

You can also select the audio codec and bit rate (default is Opus at 128Kbps). As a side note, I’m particularly impressed by the Opus codec at very low bit rate:

scrcpy --audio-codec=opus --audio-bit-rate=16k
scrcpy --audio-codec=aac --audio-bit-rate=16k

See the audio documentation page for more details.


The first version of scrcpy was released 5 years ago. Since then, audio forwarding has been one of the most requested features (see issue #14).

I made a first experimentation and developed USBaudio as a solution, but it worked poorly and the feature it relied on was deprecated in Android 8.

With the introduction of a new API to capture audio from an Android app in Android 10, I made a prototype called sndcpy. However, there were several issues. Firstly, it required to be invoked from an Android app (the scrcpy server is not an Android app, but a Java executable run with shell permissions). Most importantly, this API lets apps decide whether they can be captured or not, meaning many apps simply could not be captured, causing confusion for users.

By the end of January, @yume-chan (a scrcpy user), provided a proof-of-concept to capture the device audio with shell permissions and also proposed a working workaround for Android 11.

Since then, I have been working on a proper integration into scrcpy (my evenings and weekends have been quite busy 🙂). I added encoding, recording, buffering and playback with clock drift compensation to prevent audio delay from drifting.

Below are more technical details.

Audio capture

On the device, audio is captured by an AudioRecord with REMOTE_SUBMIX as the audio source.

The API is straightforward to use, but not very low-latency friendly. It is possible to read a number of requested bytes in one of two modes:

However, the most useful mode, which is a blocking read that may return less data than requested (like the read() system call), is missing.

Since the amount of data available is unknown beforehand, in READ_BLOCKING mode, scrcpy might wait for too long. Conversely, in READ_NON_BLOCKING mode, scrcpy would read in a live-loop, burning CPU while the function returns 0 most of the time.

I decided to use READ_BLOCKING with a size of 5ms (960 bytes).

Anyway, in practice, on the devices I tested on, audio blocks are produced only every 20ms, introducing a latency of 20ms. This is not a limiting factor though, since default OPUS and AAC encoders implementations on Android produce frame sizes of 960 samples (20ms) and 1024 samples (21.33ms) respectively (and they are not configurable).

In these conditions, scrcpy reads successively 4 blocks of 5 ms every 20ms. Although the number of requested bytes could be increased to 20ms (3840 bytes), in theory some devices might capture audio faster.

With the missing blocking mode (READ_BLOCKING_THE_REAL_ONE), it would be possible to request a read with a larger buffer (e.g. 500ms) in one call, and the AudioRecord would return as much data as possible whenever it is available.

Audio encoding

The captured audio samples are then encoded by MediaCodec, which offers both synchronous and asynchronous APIs.

For our purpose, we need to execute two actions in parallel:

Therefore, the asynchronous API is more suitable than the synchronous one.

Here is how it is documented:

MediaCodec codec = MediaCodec.createByCodecName(name);
codec.setCallback(new MediaCodec.Callback() {
    void onInputBufferAvailable(MediaCodec mc, int inputBufferId) {
        ByteBuffer inputBuffer = codec.getInputBuffer(inputBufferId);
        // fill inputBuffer with valid data
        codec.queueInputBuffer(inputBufferId, );

    void onOutputBufferAvailable(MediaCodec mc, int outputBufferId, ) {
        ByteBuffer outputBuffer = codec.getOutputBuffer(outputBufferId);
        // outputBuffer is ready to be processed or rendered.
        codec.releaseOutputBuffer(outputBufferId, );


However, there is a catch: the callbacks (onInputBufferAvailable() and onOutputBufferAvailable()) are called from the same thread and cannot run in parallel.

Filling an input buffer requires a blocking call to read from the AudioRecord, while processing the output buffers involves a blocking call to send the data to the client over a socket.

If we were to process the buffers directly from the callbacks, the processing of an output buffer would be delayed until the blocking call to completes (which may be up to 20ms as described in the previous section). This would result in additional latency.

To address this issue, the callback only submits tasks to input and output queues, which are processed by dedicated threads:

// simplified
codec.setCallback(new MediaCodec.Callback() {
    void onInputBufferAvailable(MediaCodec mc, int inputBufferId) {
        inputTasks.put(new InputTask(index));

    void onOutputBufferAvailable(MediaCodec mc, int outputBufferId,
                                 MediaCodec.BufferInfo bufferInfo) {
        outputTasks.put(new OutputTask(index, bufferInfo);


Client architecture

Here is an overview of the client architecture for the video and audio streams:

                                                 V4L2 sink
                                     /         \
        VIDEO -------------> demuxer             display
        AUDIO -------------> demuxer
                                       decoder --- audio player

The video and audio are captured and encoded on the device, and the resulting packets are sent via separate sockets over an adb tunnel using a custom protocol. This protocol transmits the raw encoded packets with packet headers that provide early information about packet boundaries (useful to reduce video latency) and PTS (used for recording).

Video and audio streams are then demuxed into packets by a demuxer.

If recording is enabled, the recorder asynchronously muxes the elementary streams into MP4 or MKV. Thus, the packets are encoded on the device side, but muxed on the client side (it’s the division of labour!).

If a display or V4L2 is enabled, then the video packets must be decoded by a decoder into video frames to be displayed or sent to V4L2.

If audio playback is enabled (currently when a display is enabled), the audio packets are decoded into audio frames (blocks of samples) and played by the audio player.

Audio player

This is the last component I implemented (I wrote recording before playback), because it is the trickiest, especially to compensate for the following:

While scrcpy displays the latest received video frame without buffering, this isn’t possible for audio. Playing the latest received audio sample would be meaningless.

As input, the player regularly receives AVFrames of decoded audio samples. As output, a callback regularly requests audio samples to be played. In between, an audio buffer stores produced samples that have yet to be consumed.

The player aims to feed the audio output with as little latency as possible while avoiding buffer underrun. To achieve this, it attempts to maintain the average buffering (the number of samples present in the buffer) around a target value. If this target buffering is too low, then buffer underrun will occur frequently. If it is too high, then latency becomes unacceptable. This target value is configured using the scrcpy option --audio-buffer.

The playback relies only on buffer filling, the PTS are not used at all by the audio player (just as they are not used for video mirroring, unless video buffering is enabled). PTS are only used for recording.

The player cannot adjust the sample input rate (it receives samples produced in real-time) or the sample output rate (it must provide samples as requested by the audio output callback). Therefore, it may only apply compensation by resampling (converting m input samples to n output samples).

The compensation itself is applied by swresample (FFmpeg). It is configured using swr_set_compensation(). An important work for the player is to estimate the compensation value regularly and apply it.

The estimated buffering level is the result of averaging the “natural” buffering (samples are produced and consumed by blocks, so it must be smoothed), and making instant adjustments resulting of its own actions (explicit compensation and silence insertion on underflow), which are not smoothed.

Buffer underflow events can occur when packets arrive too late. In that case, the player inserts silence. Once the packets finally arrive (late), one strategy could be to drop the samples that were replaced by silence, in order to keep a minimal latency. However, dropping samples in case of buffer underflow is inadvisable, as it would temporarily increase the underflow even more and cause very noticeable audio glitches.

Therefore, the player doesn’t drop any sample on underflow. The compensation mechanism will absorb the delay introduced by the inserted silence.


I’m delighted that scrcpy now supports audio forwarding after much effort.

While I expect that the audio player will require some fine-tuning in the future to better handle edge cases, it currently performs quite well.

I would like to extend a huge thank you to @yume-chan for his initial proof-of-concept, which made this feature possible.

Happy mirroring!

Audio forwarding on Android 10

Tuesday 9 June 2020 à 21:50

Audio forwarding is one of the most requested features in scrcpy (see issue #14).

Last year, I published a small experiment (USBaudio) to forward audio over USB, using the AOA2 protocol. Unfortunately, this Android feature was unreliable, and has been deprecated in Android 8.

Here is a new tool I developed to play the device audio output on the computer, using the Playback Capture API introduced in Android 10: sndcpy.

The name follows the same logic:

This is a quick proof-of-concept, composed of:

The long-term goal is to implement this feature properly in scrcpy.

How to use sndcpy

You could either download the release or build the app.

VLC must be installed on the computer.

Plug an Android 10 device with USB debugging enabled, and execute:


If several devices are connected (listed by adb devices):

./sndcpy <serial>  # replace <serial> by the device serial

(omit ./ on Windows)

It will install the app on the device, and request permission to start audio capture:


Once you clicked on START NOW, press Enter in the console to start playing on the computer. Press Ctrl+c in the terminal to stop (except on Windows, just disconnect the device or stop capture from the device notifications).

The sound continues to be played on the device. The volume can be adjusted independently on the device and on the computer.

Apps restrictions

sndcpy may only forward audio from apps which do not prevent audio capture. The rules are detailed in §capture policy:

  • By default, apps that target versions up to and including to Android 9.0 do not permit playback capture. To enable it, include android:allowAudioPlaybackCapture="true" in the app’s manifest.xml file.
  • By default, apps that target Android 10 (API level 29) or higher allow their audio to be captured. To disable playback capture, include android:allowAudioPlaybackCapture="false" in the app’s manifest.xml file.

So some apps might need to be updated to support audio capture.

Integration in scrcpy

Ideally, I would like scrcpy to support audio forwarding directly. However, this will require quite a lot of work.

In particular, scrcpy does not use an Android app (required for capturing audio), it currently only runs a Java main as shell (required to inject events and capture the screen without asking).

And it will require to implement audio playback (done by VLC in this PoC), but also audio recording (for scrcpy --record file.mkv), encoding and decoding to transmit a compressed stream, handle audio-video synchronization…

Since I develop scrcpy on my free time, this feature will probably not be integrated very soon. Therefore, I prefer to release a working proof-of-concept which does the job.

Introducing USBaudio

Thursday 20 June 2019 à 09:40

Forwarding audio from Android devices

In order to support audio forwarding in scrcpy, I first implemented an experimentation on a separate branch (see issue #14). But it was too hacky and fragile to be merged (and it does not work on all platforms).

So I decided to write a separate tool: USBaudio.

It works on Linux with PulseAudio.

How to use USBaudio

First, you need to build it (follow the instructions).

Plug an Android device. If USB debugging is enabled, just execute:


If USB debugging is disabled (or if multiple devices are connected), you need to specify a device, either by their serial or vendor id and product_id (as printed by lsusb):

usbaudio -s 0123456789abcdef
usbaudio -d 18d1:4ee2

The audio should be played on the computer.

If it’s stuttering, try increasing the live caching value (at the cost of a higher latency):

# default is 50ms
usbaudio --live-caching=100

Note that it can also be directly captured by OBS:


How does it work?

USBaudio executes 3 steps successively:

  1. It enables audio accessory on the device (by sending AOA requests via libusb), so that the audio is forwarded over USB. If it works, PulseAudio (or ALSA) on the computer should detect a new audio input source.
  2. It retrieves the PulseAudio input source id associated to the Android device (via libpulse).
  3. It execs VLC to play audio from this input source.

Note that enabling audio accessory changes the USB device product id, so it will close any adb connection (and scrcpy). Therefore, you should enable audio forwarding before running scrcpy.


To only enable audio accessory without playing:

usbaudio -n
usbaudio --no-play

The audio input sources can be listed by:

pactl list short sources

For example:

$ pactl list short sources
5   alsa_input.usb-LGE_Nexus_5_05f5e60a0ae518e5-01.analog-stereo     module-alsa-card.c  s16le 2ch 44100Hz   RUNNING

Use the number (here 5) to play it with VLC:

vlc -Idummy --live-caching=50 pulse://5

Alternatively, you can use ALSA directly:

cat /proc/asound/cards

For example:

$ cat /proc/asound/cards
 1 [N5             ]: USB-Audio - Nexus 5
                      LGE Nexus 5 at usb-0000:00:14.0-4, high speed

Use the device number (here 1) as follow:

vlc -Idummy --live-caching=50 alsa://hw:1

If it works manually but not automatically (without -n), then please open an issue.


It does not work on all devices, it seems that audio accessory is not always well supported. But it’s better than nothing.

Android Q added a new playback capture API. Hopefully, scrcpy could use it to forward audio in the future (but only for Android Q devices).

A new core playlist for VLC 4

Tuesday 21 May 2019 à 09:25

The core playlist in VLC was started a long time ago. Since then, it has grown to handle too many different things, to the point it became a kind of god object.

In practice, the playlist was also controlling playback (start, stop, change volume…), configuring audio and video outputs, storing media detected by discovery

For VLC 4, we wanted a new playlist API, containing a simple list of items (instead of a tree), acting as a media provider for a player, without unrelated responsibilities.

I wrote it several months ago (at Videolabs). Now that the old one has been removed, it’s time to give some technical details.



One major design goal is to expose what UI frameworks need. Several user interfaces, like Qt, Mac OS and Android1, will use this API to display and interact with the main VLC playlist.

The playlist must be performant for common use cases and usable from multiple threads.

Indeed, in VLC, user interfaces are implemented as modules loaded dynamically. In general, there is exactly one user interface, but there may be none or (in theory) several. Thus, the playlist may not be bound to the event loop of some specific user interface. Moreover, the playlist may be modified from a player thread; for example, playing a zip archive will replace the item by its content automatically.

As a consequence, the playlist will use a mutex; to avoid ToCToU issues, it will also expose public functions to lock and unlock it. But as we will see later, there will be other consequences.

Data structure

User interfaces need random access to the playlist items, so a vector is the most natural structure to store the items. A vector is provided by the standard library of many languages (vector in C++, Vec in Rust, ArrayList in Java…). But here, we’re in C, so there is nothing.

In the playlist, we only need a vector of pointers, so I first proposed improvements to an existing type, vlc_array_t, which only supports void * as items. But it was considered useless (1, 2) because it is too limited and not type-safe.

Therefore, I wrote vlc_vector. It is implemented using macros so that it’s generic over its item type. For example, we can use a vector of ints as follow:

// declare and initialize a vector of int

// append 0, 10, 20, 30 and 40
for (int i = 0; i < 5; ++i) {
    if (!vlc_vector_push(&vec, 10 * i)) {
        // allocation failure...

// remove item at index 2

// the vector now contains [0, 10, 30, 40]

int first =[0]; // 0
int last =[vec.size - 1]; // 40

// free resources

Internally, the playlist uses a vector of playlist items:

typedef struct VLC_VECTOR(struct vlc_playlist_item *) playlist_item_vector_t;

struct vlc_playlist {
    playlist_item_vector_t items;
    // ...

Interaction with UI

UI frameworks typically use list models to bind items to a list view component. A list model must provide:

In addition, the model must notify its view when items are inserted, removed, moved or updated, and when the model is reset (the whole content should be invalidated).

For example, Qt list views use QAbstractItemModel/QAbstractListModel and the Android recycler view uses RecyclerView.Adapter.

The playlist API exposes the functions and callbacks providing these features.


However, the core playlist may not be used as a direct data source for a list model. In other words, the functions of a list model must not delegate the calls to the core playlist.

To understand why, let’s consider a typical sequence of calls executed by a view on its model, from the UI thread:


If we implemented count() and get(index) by delegating to the playlist, we would have to lock each call individually:

// in some artificial UI framework in C++

int MyModel::count() {
    // don't do this
    int count = vlc_playlist_Count();
    return count;

vlc_playlist_item_t *MyModel::get(int index) {
    // don't do this
    vlc_playlist_item_t *item = vlc_playlist_Get(playlist, index);
    return item;

Note that locking and unlocking from the UI thread for every playlist item is not a good idea for responsiveness, but this is a minor issue here.

The real problem is that locking is not sufficient to guarantee correctness: the list view expects its model to return consistent values. Our implementation can break this assumption, because the playlist content could change asynchronously between calls. Here is an example:

// the playlist initially contains 5 items: [A, B, C, D, E]
model.count(); // 5
model.get(0);  // A
model.get(1);  // B
                    // the first playlist item is removed from another thread:
                    //     vlc_playlist_RemoveOne(playlist, 0);
                    // the playlist now contains [B, C, D, E]
model.get(2);  // D
model.get(3);  // E
model.get(4);  // out-of-range, undefined behavior (probably segfault)

The view could not process any notification of the item removal before the end of the current execution in its event loop… that is, at least after model.get(4). To avoid this problem, the data provided by view models must always live in the UI thread.

This implies that the UI has to manage a copy of the playlist content. The UI playlist should be considered as a remote out-of-sync view of the core playlist.

Note that the copy must not be limited to the list of pointers to playlist items: the content which is displayed and susceptible to change asynchronously (media metadata, like title or duration) must also be copied. The UI needs a deep copy; otherwise, the content could change (and be exposed) before the list view was notified… which, again, would break assumptions about the model.


The core playlist and the UI playlist are out-of-sync. So we need to “synchronize” them:

Core to UI

The core playlist is the source of truth.

Every change to the UI playlist must occur in the UI thread, yet the core playlist notification handlers are executed from any thread. Therefore, playlist callback handlers must retrieve appropriate data from the playlist, then post an event to the UI event loop2, which will be handled from the UI thread. From there, the core playlist will be out-of-sync, so it would be incorrect to access it.

The order of events forwarded to the UI thread must be preserved. That way, the indices notified by the core playlist are necessarily valid within the context of the list model in the UI thread. The core playlist events can be understood as a sequence of “patches” that the UI playlist must apply to its own copy.

This only works if only the core playlist callbacks modify the list model content.

UI to core

Since the list model can only be modified by the core playlist callbacks, it is incorrect to modify it on user actions. As a consequence, the changes must be requested to the core playlist, which will, in turn, notify the actual changes.

The synchronization is more tricky in that direction.

To understand why, suppose the user selects items 10 to 20, then drag & drop to move them to index 42. Once the user releases the mouse button to “drop” the items, we need to lock the core playlist to apply the changes.

The problem is that, before we successfully acquired the lock, another client may have modified the playlist: it may have cleared it, or shuffled it, or removed items 5 to 15… As a consequence, we cannot apply the “move” request as is, because it was created from a previous playlist state.

To solve the issue, we need to adapt the request to make it fit the current playlist state. In other words, resolve conflicts: find the items if they had been moved, ignore the items not found for removal…

For that purpose, in addition to functions modifying the content directly, the playlist exposes functions to request “desynchronized” changes, which automatically resolve conflicts and generate an appropriate sequence of events to notify the clients of the actual changes.

Let’s take an example. Initially, our playlist contains 10 items:

[A, B, C, D, E, F, G, H, I, J]

The user selects [C, D, E, F, G] and press the Del key to remove the items. To apply the change, we need to lock the core playlist.

But at that time, another thread was holding the lock to apply some other changes. It removed F and I, and shuffled the playlist:

[E, B, D, J, C, G, H, A]

Once the other thread unlocks the playlist, our lock finally succeeds. Then, we call request_remove([C, D, E, F, G]) (this is pseudo-code, the real function is vlc_playlist_RequestRemove).

Internally, it triggers several calls:

// [E, B, D, J, C, G, H, A]
remove(index = 4, count = 2)   // remove [C, G]
// [E, B, D, J, H, A]
remove(index = 2, count = 1)   // remove [D]
// [E, B, J, H, A]
remove(index = 0, count = 1)   // remove [E]
// [B, J, H, A]

Thus, every client (including the UI from which the user requested to remove the items), will receive a sequence of 3 events on_items_removed, corresponding to each removed slice.

The slices are removed in descending order for both optimization (it minimizes the number of shifts) and simplicity (the index of a removal does not depend on previous removals).

In practice, it is very likely that the request will apply exactly to the current state of the playlist. To avoid unnecessary linear searches to find the items, these functions accept an additional index_hint parameter, giving the index of the items when the request was created. It should (hopefully) almost always be the same as the index in the current playlist state.

Random playback

Contrary to shuffle, random playback does not move the items within the playlist; instead, it does not play them sequentially.

To select the next item to play, we could just pick one at random.

But this is not ideal: some items will be selected several times (possibly in a row) while some others will not be selected at all. And if loop is disabled, when should we stop? After all n items have been selected at least once or after n playbacks?

Instead, we would like some desirable properties that work both with loop enabled and disabled:

In addition, if loop is enabled:


I wrote a randomizer to select items “randomly” within all these constraints.

To get an idea of the results, here is a sequence produced for a playlist containing 5 items (A, B, C, D and E), with loop enabled (so that it continues indefinitely):


Here is how it works.

The randomizer stores a single vector containing all the items of the playlist. This vector is not shuffled at once. Instead, steps of the Fisher-Yates algorithm are executed one-by-one on demand. This has several advantages:

It also maintains 3 indexes:

0              next  head          history       size
 <------------------->               <----------->
  determinated range                 history range

Let’s reuse the example I wrote in the documentation.

Here is the initial state with our 5 items:

                next                     |
                head                     |
                |                        |
                A    B    C    D    E

The playlist calls Next() to retrieve the next random item. The randomizer picks one item (say, D), and swaps it with the current head (A). Next() returns D.

                     next                |
                     head                |
                     |                   |
                D    B    C    A    E
           determinated range

The playlist calls Next() one more time. The randomizer selects one item outside the determinated range (say, E). Next() returns E.

                          next           |
                          head           |
                          |              |
                D    E    C    A    B
           determinated range

The playlist calls Next() one more time. The randomizer selects C (already in place). Next() returns C.

                               next      |
                               head      |
                               |         |
                D    E    C    A    B
            determinated range

The playlist then calls Prev(). Since the “current” item is C, the previous one is E, so Prev() returns E, and next moves back.

                          next           |
                          |    head      |
                          |    |         |
                D    E    C    A    B
            determinated range

The playlist calls Next(), which returns C, as expected.

                               next      |
                               head      |
                               |         |
                D    E    C    A    B
            determinated range

The playlist calls Next(), the randomizer selects B, and returns it.

                                    next |
                                    head |
                                    |    |
                D    E    C    B    A
               determinated range

The playlist calls Next(), the randomizer selects the last item (it has no choice). next and head now point one item past the end (their value is the vector size).

                D    E    C    B    A
                 determinated range

At this point, if loop is disabled, it is not possible to call Next() anymore (HasNext() returns false). So let’s enable it by calling SetLoop(), then let’s call Next() again.

This will start a new loop cycle. Firstly, next and head are reset, and the whole vector belongs to the last cycle history.

                 D    E    C    B    A
                    history range

Secondly, to avoid selecting A twice in a row (as the last item of the previous cycle and the first item of the new one), the randomizer will immediately determine another item in the vector (say C) to be the first of the new cycle. The items that belong to the history are kept in order. head and history move forward.

                next |
                |    head
                |    |
                C    D    E    B    A
      determinated     history range

Finally, it will actually select and return the first item (C).

                C    D    E    B    A
      determinated     history range

Then, the user adds an item to the playlist (F). This item is added in front of history.

                     next |
                     head |
                     |    |
                C    F    D    E    B    A
              <--->     <------------------>
      determinated          history range

The playlist calls Next(), the randomizer randomly selects E. E “disappears” from the history of the last cycle. This is a general property: each item may not appear more than once in the “history” (both from the last and the new cycle). The history order is preserved.

                          next |
                          head |
                          |    |
                C    E    F    D    B    A
              <-------->     <-------------->
             determinated     history range

The playlist then calls Prev() 3 times, that yields C, then A, then B. next is decremented (modulo size) on each call.

                               |    next
                          head |    |
                          |    |    |
                C    E    F    D    B    A
              <-------->     <-------------->
             determinated     history range

Hopefully, the resulting randomness will match what people expect in practice.


The playlist can be sorted by an ordered list of criteria (a key and a order).

The implementation is complicated by the fact that items metadata can change asynchronously (for example if the player is parsing it), making the comparison function inconsistent.

To avoid the problem, a first pass builds a list of metadata for all items, then this list is sorted, and finally the resulting order is applied back to the playlist.

As a benefit, the items are locked only once to retrieved their metadata.

Interaction with the player

For VLC 4, Thomas wrote a new player API.

A player can be used without a playlist: we can set its current media and the player can request, when necessary, the next media to play from a media provider.

A playlist, on the other hand, needs a player, and registers itself as its media provider. They are tightly coupled:

To keep them synchronized:

This poses a lock-order inversion problem: for example, if thread A locks the playlist then waits for the player lock, while thread B locks the player then waits for the playlist lock, then thread A and B are deadlocked.

To avoid the problem, the player and the playlist share the same lock. Concretely, vlc_playlist_Lock() delegates to vlc_player_Lock(). In practice, the lock should be held only for short periods of time.

Media source

A separate API (media source and media tree) was necessary to expose what is called services discovery (used to detect media from various sources like Samba or MTP), which were previously managed by the old playlist.

Thus, we could kill the old playlist.


The new playlist and player API should help to implement UI properly (spoiler: a new modern UI is being developed), to avoid racy bugs and to implement new features (spoiler: gapless).

Discuss on reddit and Hacker News.

  1. Actually, the Android app will maybe continue to implement its own playlist in Java/Kotlin, to avoid additional layers (Java/JNI and LibVLC). 

  2. Even in the case where a core playlist callback is executed from the UI thread, the event must be posted to the event queue, to avoid breaking the order. Concretely, in Qt, this means connecting signals to slots using Qt::QueuedConnection instead of the default Qt::AutoConnection

  3. We use next instead of current so that all indexes are unsigned, while current could be -1

Implementing tile encoding in rav1e

Thursday 25 April 2019 à 10:15

During the last few months at Videolabs, I added support for tile encoding in rav1e (a Rust AV1 Encoder).

What is this?

AV1 is an open and royalty-free video coding format, concurrent with HEVC (H.265).

Rav1e is an encoder written in Rust, developped by Mozilla/Xiph. As such, it takes an input video and encodes it to produce a valid AV1 bitstream.

Tile encoding

Tile encoding consists in splitting video frames into tiles that can be encoded and decoded independently in parallel (to use several CPUs), at the cost of a small loss in compression efficiency.

This speeds up encoding and increases decoding frame rate.

8 tiles (4 colums × 2 rows)

Preliminary work

To prepare for tiling, some refactoring was necessary.

A frame contains 3 planes (one for each YUV component, possibly subsampled). Each plane is stored in a contiguous array, rows after rows.

To illustrate it, here is a mini-plane containing 6×3 pixels. Padding is added for alignment (and other details), so its physical size is 8×4 pixels:


In memory, it is stored in a single array:

plane memory

The number of array items separating one pixel to the one below is called the stride. Here, the stride is 8.

The encoder often needs to process rectangular regions. For that purpose, many functions received a slice of the plane array and the stride value:

pub fn write_forty_two(slice: &mut [u16], stride: usize) {
  for y in 0..2 {
    for x in 0..4 {
      slice[y * stride + x] = 42;

This works fine, but the plane slice spans multiple rows.

Let’s split our planes into 4 tiles (2 columns × 2 rows):

plane regions

In memory, the resulting plane regions are not contiguous:

plane regions memory

In Rust, it is not sufficient not to read/write the same memory from several threads, it must be impossible to write (safe) code that could do it. More precisely, a mutable reference may not alias any other reference to the same memory.

As a consequence, passing a mutable slice (&mut [u16]) spanning multiple rows is incompatible with tiling. Instead, we need some structure, implemented with unsafe code, providing a view of the authorized region of the underlying plane.

As a first step, I replaced every piece of code which used a raw slice and the stride value by the existing PlaneSlice and PlaneMutSlice structures (which first required to make planes generic after improving the Pixel trait).

After these changes, our function could be rewritten as follow:

pub fn write_forty_two<T: Pixel>(slice: &mut PlaneMutSlice<'_, T>) {
  for y in 0..2 {
    for x in 0..4 {
      slice[y][x] = 42;

Tiling structures

So now, all the code using a raw slice and a stride value has been replaced. But if we look at the definition of PlaneMutSlice, we see that it still borrows the whole plane:

pub struct PlaneMutSlice<'a, T: Pixel> {
  pub plane: &'a mut Plane<T>,
  pub x: isize,
  pub y: isize

So the refactoring, in itself, does not solves the problem.

What is needed now is a structure that exposes a bounded region of the plane.

Minimal example

For illustration purpose, let’s consider a minimal example, solving a similar problem: split a matrix into columns.

2D array

In memory, the matrix is stored in a single array:

2D array memory

To do so, let’s define a ColumnMut type, and split the raw array into columns:

use std::marker::PhantomData;
use std::ops::{Index, IndexMut};

pub struct ColumnMut<'a> {
    data: *mut u8,
    cols: usize,
    rows: usize,
    phantom: PhantomData<&'a mut u8>,

impl Index<usize> for ColumnMut<'_> {
    type Output = u8;
    fn index(&self, index: usize) -> &Self::Output {
        assert!(index < self.rows);
        unsafe { &* * self.cols) }

impl IndexMut<usize> for ColumnMut<'_> {
    fn index_mut(&mut self, index: usize) -> &mut Self::Output {
        assert!(index < self.rows);
        unsafe { &mut * * self.cols) }

pub fn columns(
    slice: &mut [u8],
    cols: usize,
) -> impl Iterator<Item = ColumnMut<'_>> {
    assert!(slice.len() % cols == 0);
    let rows = slice.len() / cols;
    (0..cols).map(move |col| ColumnMut {
        data: &mut slice[col],
        phantom: PhantomData,

The PhantomData is necessary to bind the lifetime (in practice, when we store a raw pointer, we often need a PhantomData).

We implemented Index and IndexMut traits to provide operator []:

// via Index trait
let value = column[y];

// via IndexMut trait
column[y] = value;

The iterator returned by columns() yields a different column every time, so the borrowing rules are respected.

Now, we can read from and write to a matrix via temporary column views:

fn main() {
    let mut data = [1, 5, 3, 2,
                    4, 2, 1, 7,
                    0, 0, 0, 0];

    // for each column, write the sum
    columns(&mut data, 4).for_each(|mut col| col[2] = col[0] + col[1]);

    assert_eq!(data, [1, 5, 3, 2,
                      4, 2, 1, 7,
                      5, 7, 4, 9]);

Even if the columns are interlaced in memory, from a ColumnMut instance, it is not possible to access data belonging to another column.

Note that cols and rows fields must be kept private, otherwise they could be changed from safe code in such a way that breaks boundaries and violates borrowing rules.

In rav1e

A plane is split in a similar way, except that it provides plane regions instead of colums.

The split is recursive. For example, a Frame contains 3 Planes, so a Tile contains 3 PlaneRegions, using the same underlying memory.

In practice, more structures related to the encoding state are split into tiles, provided both in const and mut versions, so there is a whole hierarchy of tiling structures:

 +- FrameState → TileState
 |  +- Frame → Tile
 |  |  +- Plane → PlaneRegion
 |  +  RestorationState → TileRestorationState
 |  |  +- RestorationPlane → TileRestorationPlane
 |  |     +- FrameRestorationUnits → TileRestorationUnits
 |  +  FrameMotionVectors → TileMotionVectors
 +- FrameBlocks → TileBlocks

The split is done by a separate component (see, which yields a tile context containing an instance of the hierarchy of tiling views for each tile.

Relative offsets

A priori, there are mainly two possibilities to express offsets during tile encoding:

The usage of tiling views strongly favors the first choice. For example, it would be confusing if a bounded region could not be indexed from 0:

// region starting at (64, 64)
let row = &region[0]; // panic, out-of-bounds
let row = &region[64]; // ok :-/

Worse, this would not be possible at all for the second dimension:

// region starting at (64, 64)
let first_row = &region[64];
let first_column = row[64]; // wrong, a raw slice necessarily starts at 0

Therefore, offsets used in tiling views are relative to the tile (contrary to libaom and AV1 specification).

Tile encoding

Encoding a frame first involves frame-wise accesses (initialization), then tile-wise accesses (to encode tiles in parallel), then frame-wise accesses using the results of tile-encoding (deblocking, CDEF, loop restoration, …).

All the frame-level structures have been replaced by tiling views where necessary.

The tiling views exist only temporarily, during the calls to encode_tile(). While they are alive, it is not possible to access frame-level structures (the borrow checker statically prevents it).

Then the tiling structures vanish, and frame-level processing can continue.

This schema gives an overview:

      +----------------+         |
      |                |         |
      |                |         |  Frame-wise accesses
      |                |          >
      |                |         |   - FrameState<T>
      |                |         |   - Frame<T>
      +----------------+         |   - Plane<T>
                                /    - ...

              ||   tiling views
  +---+  +---+  +---+  +---+     |
  |   |  |   |  |   |  |   |     |  Tile encoding (possibly in parallel)
  +---+  +---+  +---+  +---+     |
  +---+  +---+  +---+  +---+     |  Tile-wise accesses
  |   |  |   |  |   |  |   |      >
  +---+  +---+  +---+  +---+     |   - TileStateMut<'_, T>
                                 |   - TileMut<'_, T>
  +---+  +---+  +---+  +---+     |   - PlaneRegionMut<'_, T>
  |   |  |   |  |   |  |   |     |
  +---+  +---+  +---+  +---+     |

              ||   vanishing of tiling views
      +----------------+         |
      |                |         |
      |                |         |  Frame-wise accesses
      |                |          >
      |                |         |  (deblocking, CDEF, ...)
      |                |         |
      +----------------+         |


To enable tile encoding, parameters have been added to pass the (log2) number of tiles --tile-cols-log2 and --tile-rows-log2. For example, to request 2x2 tiles:

rav1e video.y4m -o video.ivf --tile-cols-log2 1 --tile-rows-log2 1

Currently, we need to pass the log2 of the number of tiles (like in libaom, even if the aomenc options are called --tile-columns and --tile-rows), to avoid any confusion. Maybe we could find a better option which is both correct, non-confusing and user-friendly later.


Now that we can encode tiles, we must write them according to the AV1 bitstream specification, so that decoders can read the resulting file correctly.

Before tile encoding (i.e. with a single tile), rav1e produced a correct bitstream. Several changes were necessary to write multiple tiles.

Tile info

According to Tile info syntax, the frame header signals the number of columns and rows of tiles (it always signaled a single tile before).

In addition, when there are several tiles, it signals two more values, described below.

CDF update

For entropy coding, the encoder maintain and update a CDF (Cumulative Distribution Function), representing the probabilities of symbols.

After a frame is encoded, the current CDF state is saved to be possibly used as a starting state for future frames.

But with tile encoding, each tile finishes with its own CDF state, so which one should we associate to the reference frame? The answer is: any of them. But we must signal the one we choose, in context_update_tile_id; the decoder needs it to decode the bitstream.

In practice, we keep the CDF from the biggest tile.

Size of tiles size

The size of an encoded tile, in bytes, is variable (of course). Therefore, we will need to signal the size of each tile.

To gain a few bytes, the number of bytes used to store the size itself is also variable, and signaled by 2 bits in the frame header (tile_size_bytes_minus_1).

Concretely, we must choose the smallest size that is sufficient to encode all the tile sizes for the frame.

Tile group

According to General tile group OBU syntax, we need to signal two values when there are more than 1 tile:

The tile size (minus 1) is written in little endian, and use the number of bytes we signaled in the frame header.

That’s all. This is sufficient to produce a correct bitstream with multiple tiles.


Thanks to Rayon, parallelizing tile encoding is as easy as replacing iter_mut() by par_iter_mut().

I tested on my laptop (8 CPUs) several encodings to compare encoding performance (this is not a good benchmark, but it gives an idea, you are encouraged to run your own tests). Here are the results:

 tiles     time      speedup
   1    7mn02,336s    1.00×
   2    3mn53,578s    1.81×
   4    2mn12,995s    3.05×
   8*   1mn57,533s    3.59×

Speedups are quite good for 2 and 4 tiles.

*The reason why the speedup is lower than expected for 8 tiles is that my CPU has actually only 4 physical cores. See this reddit comment and this other one.


Why not 2×, 4× and 8× speedup? Mainly because of Amdahl’s law.

Tile encoding parallelizes only a part of the whole process: there are still single-threaded processings at frame-level.

Suppose that a proportion p (between 0 and 1) of a given task can be parallelized. Then its theoretical speedup is 1 / ((p/n) + (1-p)), where n is the number of threads.

 tiles   speedup   speedup   speedup
         (p=0.9)   (p=0.95)  (p=0.98)
   2      1.82×     1.90×     1.96×
   4      3.07×     3.48×     3.77×
   8      4.71×     5.93×     7.02×

Maybe counterintuitively, to increase the speedup brought by parallelization, non-parallelized code must be optimized (the more threads are used, the more the non-parallelized code represents a significant part).

The (not-so-reliable) benchmark results for 2 and 4 tiles suggest that tile encoding represents ~90% of the whole encoding process.

Fixing bugs

Not everything worked the first time.

The most common source of errors while implementing tile encoding was related to offsets.

When there was only one tile, all offsets were relative to the frame. With several tiles, some offsets are relative to the current tile, but some others are still relative to the whole frame. For example, during motion estimation, a motion vector can point outside tile boundaries in the reference frame, so we must take care to convert offsets accordingly.

The most obvious errors were catched by plane regions (which prevent access outside boundaries), but some others were more subtle.

Such errors could produce interesting images. For example, here is a screenshot of my first tiled video:


One of these offsets confusions had been quickly catched by barrbrain in intra-prediction. I then fixed a similar problem in inter-prediction.

But the final boss bug was way more sneaky: it corrupted the bitstream (so the decoder was unable to decode), but not always, and never the first frame. When an inter-frame could be decoded, it was sometimes visually corrupted, but only for some videos and for some encoding parameters.

After more than one week of investigations, I finally found it. \o/


Implementing this feature was an awesome journey. I learned a lot, both about Rust and video encoding (I didn’t even know what a tile was before I started).

Big thanks to the Mozilla/Xiph/Daala team, who has been very welcoming and helpful, and who does amazing work!

Discuss on r/programming, r/rust, r/AV1 and Hacker News.

Can't retrieve feed: file_get_contents( failed to open stream: Connection timed out