Digging into Accsoon Cineeye

Background#

In 2019, the first generation of Zhixun's Yingmou image transmission was released. In an era of rapid growth for self-media, various camera accessories have also emerged explosively, including three-axis stabilizers, image transmission, microphones... Numerous manufacturers, big and small, are involved, naturally leading to many rough products that have been hastily developed. As Zhixun's first generation image transmission product, Yingmou does exhibit some signs of a small workshop, but there are more positive aspects; the design is not particularly outstanding but very practical, and the user experience is quite good: the latency is low enough to allow for focus adjustments while watching the transmission, and it can be used for 3-4 hours without external power supply. Combined with the launch price of 799 yuan and a later second-hand price of less than 300 yuan, it remains practical to this day.

From the perspective of the transmission solution, Yingmou belongs to Wi-Fi image transmission and does not come with a dedicated receiver. The camera and other HDMI signal sources connect to the transmission transmitter, and reception requires using a mobile device to connect to the transmission via Wi-Fi, allowing remote viewing of images through a companion app. Using Wi-Fi is a suitable solution for low-cost transmission, as general Wi-Fi network cards are much easier and cheaper to develop compared to dedicated wireless solutions. Aside from Wi-Fi, there are very few solutions that can achieve a transmission rate of 5Mbps or higher to meet 1080P video streaming requirements at a low cost. For a low-cost transmission solution, not including a dedicated receiver saves nearly half of the hardware costs and allows the use of existing screens on phones and tablets, eliminating the need to purchase additional display devices, lowering the barriers to purchase and use, making Yingmou very suitable for individuals and small teams, at the cost of losing the opportunity to connect to large monitors and broadcast control rooms.

Researching Yingmou initially aimed to understand how it achieves a good Wi-Fi image transmission experience. After being able to correctly parse the data stream sent by Yingmou, I began attempting to "customize" a receiving end for it, trying to solve the problem of not being able to connect to large screens or use it for live streaming due to the lack of a receiver. I implemented prototypes for a receiver solution based on Allwinner D1 and an OBS Studio plugin. During the research process, I found that choosing Yingmou was like picking a soft persimmon: the Hisilicon hardware solution, the Wi-Fi transmission method, and the rudimentary initial version of the receiving app provided many available means for learning and practice. The following is a record of the entire process.

Hardware#

After removing the screws on the front, the Yingmou shell can be opened to reveal the hardware solution. Below are the main chips and basic information:

951A1855

951A1854

Main Chip: Hisilicon Hi3516ARBCV100 SoC
Interface Chip: Lontium IT6801FN HDMI - BT.1120
RAM: Hynix H5TC2G63GFR DDR3L
Flash: Macronix MX25L12835F 16MB SPI Flash
Network Card: Olink 8121N-UH module, Qualcomm Atheros AR1021X chip, 2x2 802.11 a/n 5G, USB 2.0 interface
MCU: ST STM32F030C8T6

The main chip, Hisilicon Hi3516A, has a single-core Cortex-A7 processor, and the official SDK is based on the Linux 3.4 kernel, focusing on having H.264/H.265 video hardware codecs. Officially positioned as "a professional HD IP camera SoC with integrated next-generation ISP," it is indeed widely used in the surveillance field, revealing the essence of low-cost image transmission—by replacing the image sensor input with HDMI input based on surveillance solutions, it becomes an image transmission device. The shipment volume of surveillance solutions is very large, and this SoC has also removed peripherals like displays that are unnecessary for surveillance, allowing for very low costs. A slightly special aspect is how to convert HDMI input into interfaces like MIPI CSI and BT.1120 that are widely available on the SoC, but at least there are existing ICs to choose from. Similar products with this kind of idea include HDMI encoders that also utilize Hisilicon solutions. Compared to Wi-Fi image transmission that requires an external Wi-Fi network card, they can utilize the SoC's built-in GMAC, needing only to connect an Ethernet PHY chip to achieve wired network connections, capable of capturing HDMI and stably live streaming.

What initially shocked me about Yingmou's hardware solution was the mere 16MB of Flash: a device running Linux only needs 16MB of space. However, upon calm analysis, many routers running OpenWRT also only require 16MB or even 4MB of Flash; the demand for space in video processing mainly lies in RAM. As long as one is willing to forgo the rich software packages of distributions like Debian and trim down everywhere, a Linux designed for specific tasks can occupy very little space.

Board Environment#

Boot#

Using Hisilicon chips, development will basically follow the official SDK. By connecting wires to the three solder points marked R, T, G on the board, after powering on, the U-Boot and HiLinux boot logs appear, confirming it is a genuine Hisilicon.

Executing printenv under U-Boot can obtain the command to boot the kernel and the parameters passed to the kernel. From the output, it can be seen that the layout of the SPI Flash is 1M (boot), 3M (kernel), 12M (rootfs), with rootfs being a 12MB jffs2 file system. The boot process first probes the SPI Flash (sf) device to obtain Flash information, then reads 0x300000 (3MB) of kernel from Flash starting from an offset of 0x100000 (1MB) into memory address 0x82000000, and finally uses the bootm command to boot the kernel from memory.

bootfile="uImage"
bootcmd=sf probe 0;sf read 0x82000000 0x100000 0x300000;bootm 0x82000000
bootargs=mem=128M console=ttyAMA0,115200 root=/dev/mtdblock2 rootfstype=jffs2 mtdparts=hi_sfc:1M(boot),3M(kernel),12M(rootfs)

After System Boot#

Once in the system, how to find the image transmission program running on the board? According to the Hisilicon development environment user guide, programs that need to run automatically after the system starts can be added to /etc/init.d/rcS. Therefore, I opened /etc/init.d/rcS to check.

The main content in rcS is as follows:

First, the kernel network buffer is modified, with the write buffer set to 0x200000 (2MB) and the read buffer set to 0x80000 (512KB):

#sys conf
sysctl -w net.core.wmem_max=2097152
sysctl -w net.core.wmem_default=2097152
sysctl -w net.core.rmem_max=524288
sysctl -w net.core.rmem_default=524288

Loading the wireless network card driver

insmod /ko/wifi/ath6kl/compat.ko
insmod /ko/wifi/ath6kl/cfg80211.ko
insmod /ko/wifi/ath6kl/ath6kl_usb.ko reg_domain=0x8349

IP/DHCP configuration

ifconfig wlan0 10.0.0.1 netmask 255.255.255.0 up
echo udhcpd
udhcpd /etc/wifi/udhcpd.conf &
#echo hostapd
#hostapd /etc/wifi/hostap.conf &

Loading MPP driver, consistent with the SDK documentation

cd /ko
./load3516a -i -sensor bt1120 -osmem 128 -online

When loading MPP, initialization is mainly performed, loading many kernel modules, and the output log is as follows:

Hisilicon Media Memory Zone Manager
Module himedia: init ok
hi3516a_base: module license 'Proprietary' taints kernel.
Disabling lock debugging due to kernel taint
load sys.ko for Hi3516A...OK!
load tde.ko ...OK!
load region.ko ....OK!
load vgs.ko for Hi3516A...OK!
ISP Mod init!
load viu.ko for Hi3516A...OK!
load vpss.ko ....OK!
load vou.ko ....OK!
load hifb.ko OK!
load rc.ko for Hi3516A...OK!
load venc.ko for Hi3516A...OK!
load chnl.ko for Hi3516A...OK!
load h264e.ko for Hi3516A...OK!
load h265e.ko for Hi3516A...OK!
load jpege.ko for Hi3516A...OK!
load vda.ko ....OK!
load ive.ko for Hi3516A...OK!
==== Your input Sensor type is bt1120 ====
acodec inited!
insert audio
==== Your input Sensor type is bt1120 ====
mipi_init
init phy power successful!
load hi_mipi driver successful!

After this, a program named RtMonitor will run, where all image transmission business logic is implemented.

The exploration of the board environment went surprisingly smoothly without any hindrances, and there were even some debugging comments left in the scripts. In fact, the Hisilicon SDK documentation provides various means of encryption at each stage, such as disabling the serial port, setting the root account password, etc. Applying any of these would cause considerable trouble.

Transmission#

Packet Capture#

First, I tried to capture packets via Wi-Fi to see what was being transmitted. Since iOS apps can be installed on ARM architecture Macs, I ran the Accsoon app and opened Wireshark to start capturing packets. I found three types of packets:

Image Transmission → Receiver UDP: Large data volume, presumed to be the image transmission data stream;
Receiver → Image Transmission UDP: Very short, presumed to be data acknowledgment packets;
Receiver → Image Transmission TCP: Sent when opening the image transmission interface, triggering the aforementioned UDP transmission, approximately one packet every 0.5-1 second, presumed to be heartbeat keep-alive, containing the text "ACCSOON".

Additionally, I conducted a capture in monitor mode. It was found that when multiple devices are connected, due to the lack of high-speed multicast/broadcast mechanisms in Wi-Fi, data needs to be sent separately to each device, significantly increasing channel pressure.

Another disadvantage of Wi-Fi image transmission is that if the 802.11 protocol is followed without modifying the interframe space and backoff random numbers to gain an unfair advantage in channel contention, this image transmission does not have a higher transmission priority compared to other Wi-Fi devices. When many other Wi-Fi devices are active on the channel, it inevitably leads to image transmission stuttering. Fortunately, the congestion level of the 5GHz channel is generally better than that of 2.4GHz.

Decompiling Android APK#

It was still difficult to see the specific content of the data packets through packet capture, especially the meanings of the various fields in the header. Therefore, I attempted to analyze the logic within the Zhixun Android app. Since the updated version added more code to support other devices, the older version was more conducive to analysis. I downloaded an older version that supports Yingmou image transmission from apkpure (Accsoon 1.2.5 Android APK). Using Jadx to decompile the apk, I mainly looked for the following content:

UDP video stream data packet composition, to correctly parse the video stream;
TCP control command content and sending logic, to correctly trigger the device to start sending functionality.

Untitled 7

Analysis of the key logic in the Java code:

MediaCodecUtil Class
- Encapsulates operations on Android's native codec interface MediaCodec.
- In the constructor, MediaCodec is initialized, and from the parameters during initialization, it can be inferred that the decoder used is "video/avc", meaning the transmitted video stream is H.264 encoded.
- When initializing MediaCodec, the MediaCodec.configure method passes in a Surface, and MediaCodec will directly output the decoded video frames to the BufferQueue of that Surface and call back onFrameAvailable().
- The putDataToInputBuffer method corresponds to the input buffer of MediaCodec. It will request an empty buffer from the buffer queue, copy the data that needs to be decoded into it, and then place it into the input buffer queue.
- The renderOutputBuffer method corresponds to the output buffer of MediaCodec. It will retrieve the decoded data from the output buffer queue and then release that buffer.

MediaServerClass Class
- The Start() method calls MediaRtms.Start() and TcpLinkClass.StartMediaStream(), respectively starting UDP and TCP. H264FrameReceiveHandle is passed as a callback function when instantiating MediaRtms. When H264FrameReceiveHandle is called, it will ultimately call putDataToInputBuffer and renderOutputBuffer in MediaCodecUtil.
MediaRtms Class
- A simple encapsulation of the rtmsBase class.
- The Start() method creates a DatagramSocket and starts a udpRxThread thread. In this thread, it continuously receives data, and after receiving data of a certain length, it parses the packet header, and if it is video, it calls the H264FrameReceiveHandle callback.
TcpLinkClass Class
- After calling StartMediaStream(), it will start a KeepAliveThread thread. In this thread, it calls a method named StaOp in the TcpLinkClass class at 1-second intervals, which implements the processes of TCP connection, sending heartbeat packets, and disconnecting.
SurfaceRender Class
- Video is displayed on the GLSurfaceView control. In VideoMainActivity, the setRenderer method is called to set SurfaceRender as the renderer for GLSurfaceView.
- The onSurfaceCreated method creates a SurfaceTexture (mSurfaceTexture) bound to an OpenGL texture (mTextureId) to receive video frames decoded by MediaCodec. It also creates the frame buffer object and texture required for off-screen rendering, preparing for effect processing.
- The onDrawFrame method draws the current frame. It calls the updateTexImage method to update the latest image frame from the SurfaceTexture to the bound OpenGL texture. At this point, it switches to off-screen rendering, using a shader program to overlay the video frame texture and LUT texture to achieve 3D LUT application; it then switches back to normal rendering, using the texture obtained from off-screen rendering to implement zebra stripes, black and white effects, etc., and displays them; elements like center lines and proportion boxes are drawn separately.

Let's take a closer look at how long the packet header is and what information it contains.

TCP Frame:

UDP Frame:

Each Message contains a frame of code stream, and each Message has a Message Header in front of it:

Each Message is divided into several segments Frame sent, and each Frame has a Frame Header in front of it:

H.264 Stream Extraction#

Knowing the structure of the data packets, we can start parsing. After reassembling the Message based on the Frame segments, it was found that the content of the Message starts with the fixed prefix 0x000001, exhibiting characteristics of NALU (Network Abstraction Layer Unit), containing a one-byte NALU Header, with a focus on nal_unit_type, which is used to determine the content type in the Payload. Theoretically, at this point, all that is needed is to send the content of the Message into the decoder one by one to decode the video stream. However, it is important to note that the decoder needs to rely on parameters such as profile, level, width, height, and deblock filter stored in SPS and PPS to decode correctly, which must be provided to the decoder before the I-frame. Therefore, it is best to wait for SPS and PPS based on nal_unit_type and ensure they are sent to the decoder first.

Untitled 13

NAL Header:

NALU Type:

Receiver Design#

Receiving Solution 1 - Computer Reception#

Once the data packet structure is clear, all that is needed is to correctly receive the data packets, extract the H.264 stream from them, and send it to the decoder. To facilitate efficient development and debugging, I started on a computer. Using multimedia frameworks like FFmpeg (libav) or GStreamer (libgst), decoding can be easily implemented. I first tried using FFmpeg, which mainly requires the following process:

Decoder Initialization: Use avcodec_find_decoder() to find the H.264 decoder, use avcodec_alloc_context3() to create a context, and use avcodec_open2() to open the decoder.
Data Decoding: Use av_packet_from_data() to store data in AVPacket, then use avcodec_send_packet() to send it to the decoder, and use avcodec_receive_frame() to retrieve the decoded data from AVFrame.

The overall logic of the program is roughly as follows:

Main Thread: Initializes the FFmpeg decoder and SDL display, starts UDP and TCP threads. It then loops waiting for available data signals to decode and display the data.
UDP Thread: Receives packets, collects all segments corresponding to each msg_id, combines them into a complete message, and places the content into shared memory, signaling the main thread.
TCP Thread: Sends heartbeat packets at regular intervals.

Connecting to the image transmission's Wi-Fi, I ran the software. The image transmission connected to the RX0 small camera, which filmed a stopwatch on a phone for a rough latency test. The left side displayed the image after going through the phone screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the image transmission → computer wirelessly receiving and displaying from the monitor. The end-to-end latency was roughly around 200ms.

3N5A3075

Receiving Solution 2 - Development Board#

Having a program running on the computer, I hoped to port it to embedded hardware. I had previously acquired a MangoPi MQ-Pro D1 with Allwinner D1, which has HDMI output, an H.264 hardware decoder, and provides a complete SDK and documentation, meeting most of the requirements for creating a receiving end. The downside was that the Wi-Fi network card only supported 2.4GHz, so I replaced it with an RTL8821CS network card that supports the 5GHz band and compiled the corresponding driver to connect to the Yingmou hotspot.

Allwinner provides the Tina Linux SDK for D1. The highlight of Tina is that it is built on a Linux kernel + OpenWRT construction system, making AIoT products, especially smart speakers, more lightweight. After all, OpenWRT is more widely known for its use in routers with very limited memory and storage. The promotion claims that a system that originally required 1GB DDR + 8GB eMMC can now run on just 64MB DDR + 128MB NAND Flash using the Tina Linux system.

The D1 chip has an H.264 hardware decoder, and the Tina system supports the OpenMAX interface of libcedar, allowing GStreamer to use the omxh264dec plugin to call libcedar for hardware video decoding. Additionally, Tina provides the sunxifbsink plugin, which can call DE to implement YV12 → RGB. Therefore, using GStreamer for decoding and display became the best choice. After configuring the SDK according to this article and resolving various compilation issues, I obtained GStreamer with the aforementioned plugins, and then application development could proceed.

Although I did not think of Solution 2 while working on Solution 1 and used FFmpeg, the TCP control commands and UDP data acquisition parts can be reused. The core of using GStreamer lies in constructing a pipeline composed of elements. To send the frame data obtained from UDP into the pipeline, we can use GStreamer’s appsrc. appsrc provides an API for sending data into the GStreamer pipeline. appsrc has two modes: push mode and pull mode. In pull mode, appsrc will obtain the corresponding data from the application when it needs data through a specified interface. In push mode, the application actively pushes data into the pipeline. If we adopt the push method, we can actively "send" data into appsrc in the UDP receiving thread. Therefore, we create a pipeline with the following process:

Create elements

appsrc = gst_element_factory_make("appsrc", "source");
parse = gst_element_factory_make("h264parse", "parse");
decoder = gst_element_factory_make("omxh264dec", "decoder");
sink = gst_element_factory_make("sunxifbsink", "videosink");

Each element's properties are set using g_object_set(), where caps defines the format and properties of the data stream for correct handling by the elements and negotiation between them. In this application, the caps of appsrc is the most important; otherwise, subsequent elements will not know what format the received content is. The caps configuration for appsrc is as follows:

GstCaps *caps = gst_caps_new_simple("video/x-h264",
                                    "width", G_TYPE_INT, 1920,
                                    "height", G_TYPE_INT, 1080,
                                    "framerate", GST_TYPE_FRACTION, 30, 1,
                                    "alignment", G_TYPE_STRING, "nal",
                                    "stream-format", G_TYPE_STRING, "byte-stream",
                                    NULL);
g_object_set(appsrc, "caps", caps, NULL);

Create the pipeline and add and link elements

pipeline = gst_pipeline_new("test-pipeline");
gst_bin_add_many(GST_BIN(pipeline), appsrc, parse, decoder, sink, NULL);
gst_element_link_many(appsrc, parse, decoder, sink, NULL);

This forms a pipeline of appsrc→h264parse→omxh264dec→sunxifbsink.

In the UDP thread, we still loop to receive, collect all segments corresponding to each msg_id, combine them into a complete message, place the content into gst_buffer, and push the buffer gst_buffer into appsrc using g_signal_emit_by_name(appsrc, "push-buffer", gst_buffer, &ret). In the gst_buffer, in addition to the frame data itself, dts, pts, and duration are important time parameters that need to be passed. By setting the do-timestamp property of appsrc to TRUE, appsrc will automatically set timestamps when receiving the buffer. However, the duration must be set according to the frame rate. If not set, it was observed that there would be an indescribable "stuttering" feeling, likely due to the lack of duration settings leading to unstable playback speed. Although it may introduce additional latency, to ensure a good viewing experience, it is still advisable to set it.

To compile the completed code, a Makefile can be written so that our code is compiled as a software package for OpenWRT and included in the rootfs build.

Running the software on the board, the image transmission connected to the RX0 small camera filmed the stopwatch on the screen for a rough latency test. The specific process was left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the image transmission → development board wirelessly receiving and outputting via HDMI → HDMI input to the right display. The end-to-end latency was roughly between 200-300ms, which is not low. The good aspect is that watching the video played on the screen through the image transmission was relatively smooth.

951A1816

Testing the camera directly connected to the display, the process was left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the right display, with latency around 70ms. Therefore, the latency of the image transmission itself is roughly between 130ms-230ms.

3N5A3029

With this receiving end, it is possible to connect various monitors via HDMI, allowing Yingmou to not be limited to monitoring via phones and tablets.

Receiving Solution 3 - OBS Studio Plugin#

The first two receiving solutions allow monitoring using a computer and HDMI display devices when using Yingmou, but they still do not meet the needs for low-latency live streaming. If the receiving program from Solution 1 is slightly modified to send the H.264 stream to OBS Studio via UDP on localhost, it is found that enabling buffering results in smoothness but high latency, while disabling buffering leads to low latency but frequent stuttering not present during monitoring; Solution 2, while capable of connecting to a capture card to capture HDMI output, would increase latency due to decoding, output, and the capture card on the development board. To reduce latency, directly developing an OBS plugin is almost the best choice.

OBS Studio supports extending functionality through plugins Plugins — OBS Studio 30.0.0 documentation (obsproject.com). According to the introduction, developing a Source type plugin allows video sources to be integrated into OBS. There are synchronous video sources (Synchronous Video Source) and asynchronous video sources (Asynchronous Video Source) in OBS Studio's Source class plugin development. Synchronous video sources, like Image Source, synchronize with OBS's rendering loop, with OBS actively calling the video source's rendering function to obtain frame data, suitable for graphic rendering or special effects processing; asynchronous video sources can run in independent worker threads, asynchronously with OBS's rendering loop, actively pushing frame data to OBS. Asynchronous is more suitable for network streams and camera inputs.

Based on the provided plugin template obs-plugintemplate, I established the project, prepared the environment, and completed the logic by referencing the existing image_source plugin source code in OBS. The changes required were minimal, as most of the code could be reused from Solution 2. The difference is that the frame content obtained from decoding cannot be sent directly to a display element; it needs to be obtained through appsink and called with obs_source_output_video() to hand it over to OBS Studio.

After successful compilation, the .so file from the build directory was copied to the OBS Studio plugin directory (e.g., /usr/local/lib/obs-plugins/), and starting OBS Studio allowed for testing. Similarly, latency tests were conducted, with the specific process being left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the image transmission → computer OBS plugin wirelessly receiving and displaying. The end-to-end latency was roughly around 200ms, and watching the video played on the screen through the image transmission was also smooth and coherent.

IMG_20241003_224607-Enhanced-NR-1_1

OBS Studio plugin latency test: left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the image transmission → computer OBS plugin wirelessly receiving and displaying.

Connecting all three solutions simultaneously, an increase in stuttering can be perceived, but all still maintain a reasonably acceptable latency.

3N5A3116

Summary#

To some extent, with the support of powerful codecs and mature wireless technology, achieving real-time video transmission is not that difficult, and the software logic can be very simple and direct. The widespread use of Hisilicon chips with hardware encoders in surveillance and the popularity of 5GHz Wi-Fi have almost inadvertently enabled products like Wi-Fi wireless image transmission to achieve good real-time video transmission experiences at extremely low costs. Coupled with the booming development of large-screen devices, the barriers are further lowered. Unfortunately, their upper limits are quite restricted: enjoying the Wi-Fi ecosystem means enduring its congestion; enjoying mature hardware encoders also essentially loses the freedom to make modifications. Of course, this does not prevent Yingmou itself from being a rather "fully functional" product; it can accomplish its intended tasks well without serious shortcomings. Despite the emergence of many new products, it can still effectively meet basic image transmission needs. Zhixun once introduced the manufacturing process of Yingmou in a live broadcast, and their ability to accurately grasp user pain points early on and thoughtfully create a highly completed product is still admirable.