x264、x265中cuTree原理分析

mbtree是x264中引⼊的⼀项创新性技术，可以有效提⾼主客观质量（参考⽂章最后的表格1）。x265继承了这⼀算法，改名为

cuTree，算法本⾝实现较为复杂，下⾯探讨⼀下cutree原理，结合代码来分析实现细节。

cutree和mbtree都是根据当前块被参考的程度调整qpOffset，要知道当前块被参考的程度，很显然需要⼀个编码的反推过程。

对于帧间参考，参考帧的质量显然对当前帧质量有直接影响。即参考块的编码代价，除了要考虑本⾝的编码代价外，还需考虑对将来参考

到当前块的那些块的影响⼒。因此，cutree在分析每个块的Cost时，引⼊了⼀个PropagateInCost的概念：即每个块的Cost，不仅是⾃⼰

本⾝编码的Cost，还要加上后续块依赖于当前块的Cost，这个Cost称之为PropagateInCost，所以关键是如何确定PropagateInCost。

考虑以下简化情形：假设B块完全参考了A块，B块帧内帧间预测分别为IntraCostB和InterCostB。分析趋势：如果IntraCostB与

InterCostB差不多⼤，说明B块从A块获取的信息量很少；反之，如果IntraCostB⽐InterCostB⼤很多，说明B块⼤部分信息可以从A块获

取。基于这个思想，B块本⾝从A块获取的信息量可以表达为：(IntraCostB - InterCostB) 。进⼀步考虑，B块也被其他块参考了，所以B

块的Cost也包含了PropagateInCostB。综上：B块依赖于A块的Cost为：

(IntraCostB - InterCostB) + PropagateInCostB * (IntraCostB - InterCostB) / IntraCostB = (1 + PropagateInCostB) *

(IntraCostB - InterCostB) / IntraCostB1080p mv

其中：(IntraCostB - InterCostB) / IntraCostB表⽰B块的PropagateInCostB有多少⽐例要传递到A块。

如下是x265计算PropagateCost的函数，其基本思想就是上⾯所述。

/* Estimate the total amount of influence on future quality that could be had if we

* were to improve the reference samples used to inter predict any given CU. */

static void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const d {

double fps = *fpsFactor / 256; // range[0.01, 1.00]

for (int i = 0; i < len; i++)

{

int intraCost = intraCosts[i];

int interCost = X265_MIN(intraCosts[i], interCosts[i] & LOWRES_COST_MASK);

double propagateIntra = intraCost * invQscales[i]; // Q16 x Q8.8 = Q24.8

double propagateAmount = (double)propagateIn[i] + propagateIntra * fps; // Q16.0 + Q24.8 x Q0.x = Q25.0

double propagateNum = (double)(intraCost - interCost); // Q32 - Q32 = Q33.0

double propagateDenom = (double)intraCost; // Q32

dst[i] = (int)(propagateAmount * propagateNum / propagateDenom + 0.5);

}

如前所述，B块完全参考A块，则A块的PropagateCostInA = (1 + PropagateInCostB) * (IntraCostB - InterCostB) / IntraCostB

考虑更复杂的情况，由于MV不可能都指向⼀个完整的编码块，所以B块的PropateCostB在参考帧中要被按⽐例地加⼊到对应的参考块

中。如下为x265的cutree函数：

void Lookahead::estimateCUPropagate(Lowres **frames, double averageDuration, int p0, int p1, int b, int referenced)

{

uint16_t *refCosts[2] = { frames[p0]->propagateCost, frames[p1]->propagateCost };

int32_t distScaleFactor = (((b - p0) << 8) + ((p1 - p0) >> 1)) / (p1 - p0);

int32_t bipredWeight = m_param->bEnableWeightedBiPred ? 64 - (distScaleFactor >> 2) : 32;

int32_t bipredWeights[2] = { bipredWeight, 64 - bipredWeight };

int listDist[2] = { b - p0 - 1, p1 - b - 1 };

memset(m_scratch, 0, m_8x8Width * sizeof(int));

白石溪

uint16_t *propagateCost = frames[b]->propagateCost;

x265_emms();

double fpsFactor = CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) / CLIP_DURATION(averageDuration);

/* For non-referred frames the source costs are always zero, so just memset one row and re-use it. */

if (!referenced)

memset(frames[b]->propagateCost, 0, m_8x8Width * sizeof(uint16_t));

int32_t strideInCU = m_8x8Width;

for (uint16_t blocky = 0; blocky < m_8x8Height; blocky++)

{

int cuIndex = blocky * strideInCU;

// 计算frames[b]每个块的PropagateInCost，结果存储到m_scratch中

if (m_param->rc.qgSize == 8)

primitives.propagateCost(m_scratch, propagateCost,

frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,

frames[b]->invQscaleFactor8x8 + cuIndex, &fpsFactor, m_8x8Width);

else

primitives.propagateCost(m_scratch, propagateCost,

frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,

frames[b]->invQscaleFactor + cuIndex, &fpsFactor, m_8x8Width);

if (referenced)

propagateCost += m_8x8Width;

// 将frames[b]中的PropagateInCost 按⽐例加到参考帧中每个块⾥

for (uint16_t blockx = 0; blockx < m_8x8Width; blockx++, cuIndex++)

{

int32_t propagate_amount = m_scratch[blockx];

/* Don't propagate for an intra block. */

if (propagate_amount > 0)

{

/* Access width-2 bitfield. */

int32_t lists_used = frames[b]->lowresCosts[b - p0][p1 - b][cuIndex] >> LOWRES_COST_SHIFT;

/* Follow the MVs to the previous frame(s). */

for (uint16_t list = 0; list < 2; list++)

{

if ((lists_used >> list) & 1)

{

#define CLIP_ADD(s, x) (s) = (uint16_t)X265_MIN((s) + (x), (1 << 16) - 1)

int32_t listamount = propagate_amount;

/* Apply bipred weighting. */

if (lists_used == 3)

listamount = (listamount * bipredWeights[list] + 32) >> 6;

人在世上飘MV *mvs = frames[b]->lowresMvs[list][listDist[list]];

/* Early termination for simple case of mv0. */

// MV(0, 0)，直接加到参考帧的PropateCost数组中

if (!mvs[cuIndex].word)

{

CLIP_ADD(refCosts[list][cuIndex], listamount);

continue;

}

// MV不为(0, 0)时，参考块为四个块的⼦区域，分别为idx0, idx1, idx2, idx3，⽐例为idx0weight, idx1weight, idx2weidht, idx3weidht int32_t x = mvs[cuIndex].x;

int32_t y = mvs[cuIndex].y;

int32_t cux = (x >> 5) + blockx;

int32_t cuy = (y >> 5) + blocky;

int32_t idx0 = cux + cuy * strideInCU;

int32_t idx1 = idx0 + 1;

int32_t idx2 = idx0 + strideInCU;

int32_t idx3 = idx0 + strideInCU + 1;

x &= 31;

y &= 31;

int32_t idx0weight = (32 - y) * (32 - x);

int32_t idx1weight = (32 - y) * x;

int32_t idx2weight = y * (32 - x);

int32_t idx3weight = y * x;

/* We could just clip the MVs, but pixels that lie outside the frame probably shouldn't

* be counted. */

if (cux < m_8x8Width - 1 && cuy < m_8x8Height - 1 && cux >= 0 && cuy >= 0)

{

CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);

CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);

CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);

CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);

}

else /* Check offsets individually */

{

if (cux < m_8x8Width && cuy < m_8x8Height && cux >= 0 && cuy >= 0)

CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);

if (cux + 1 < m_8x8Width && cuy < m_8x8Height && cux + 1 >= 0 && cuy >= 0)

CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);

if (cux < m_8x8Width && cuy + 1 < m_8x8Height && cux >= 0 && cuy + 1 >= 0)

CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);

if (cux + 1 < m_8x8Width && cuy + 1 < m_8x8Height && cux + 1 >= 0 && cuy + 1 >= 0)

CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);

}

if (m_param->rc.vbvBufferSize && m_param->lookaheadDepth && referenced)

cuTreeFinish(frames[b], averageDuration, b == p1 ? b - p0 : 0);

}

最后，当前Cu的QPOffset肯定是与PropagateInCost有关的，PropagateInCost越⼤，则CU的qp应该越⼩，QPOffset是负值，也应该越⼩，x265中cutree的QPOffset = -strength * log2(1 + PropagateInCost / IntraCost)，具体代码，参考函数cuTreeFinish，如下所⽰。

void Lookahead::cuTreeFinish(Lowres *frame, double averageDuration, int ref0Distance)

{

int fpsFactor = (int)(CLIP_DURATION(averageDuration) / CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) * 256);

double weightdelta = 0.0;

if (ref0Distance && frame->weightedCostDelta[ref0Distance - 1] > 0)

weightdelta = (1.0 - frame->weightedCostDelta[ref0Distance - 1]);

frame->qpAvgFrmCuTreeOffset = 0.0;

for (int cuIndex = 0; cuIndex < m_cuCount; cuIndex++)

{

int intracost = (frame->intraCost[cuIndex] * frame->invQscaleFactor[cuIndex] + 128) >> 8;

if (intracost)

{

int propagateCost = (frame->propagateCost[cuIndex] * fpsFactor + 128) >> 8;

double log2_ratio = X265_LOG2(intracost + propagateCost) - X265_LOG2(intracost) + weightdelta;

frame->qpCuTreeOffset[cuIndex] = frame->qpAqOffset[cuIndex] - m_cuTreeStrength * log2_ratio;

frame->qpAvgFrmCuTreeOffset += frame->qpCuTreeOffset[cuIndex];

}

frame->qpAvgFrmCuTreeOffset /= m_cuCount;

}

下⾯表1、表2为x265 v2.4版本中，cutree对编码客观质量的影响，编码配置为：preset=medium， ratecontrol=ABR，BFrames = 3（or = 0），aq-mode=off，测试序列为HEVC中的Class B(1080p)。当BFrames=3时，cuTree开启后，Y的bitrate节省

6.52%，U的码率节省15.38%，V的码率节省15.56%，压缩效率提升⾮常明显。当BFrames=0时，cu

Tree开启后，Y的bitrate增加

0.72%，U的码率节省2.4%，V的码率节省1.4%，压缩效率没什么提升。这是因为BFrames=3时，CuTree对I和P，QP调⼩的幅度⼤，对B-Ref，QP适当调⼩，对B-Non-Ref，QP不做调整，本质与HM中的Hierarchichal QP差不多；当BFrames=0时，所有P帧的QP都被调⼩，幅度都差不多，这样其实相当于没有调整QP了。

表1、x265中cuTree对编码码率的节省(BFrames=3)

我爱你吗王聃葳

Sequence BD-Rate Y BD-Rate U BD-Rate V

BasketballDrive-4.8%-8.5%-3.5%

Bqterrace-3.5%-19.2%-21.3%

Cactus-9.6%-15.9%-13.9%

Kimono-3.3%-13.4%-17.1%

ParkScene-11.4%-19.9%-22.0%

Average-6.52%-15.38%-15.56%

送战友

表2、x265中cuTree对编码码率的节省(BFrames=0)

Sequence BD-Rate Y BD-Rate U BD-Rate V

BasketballDrive 2.4% 2.3% 4.9%

Bqterrace 4.1%-0.9%0.8%

Cactus-1.8%-4.6%-2.3%

Kimono 2.3%-0.6%-2.7%

ParkScene-3.4%-8.2%-7.7%

Average0.72%-2.4%-1.4%

需要注意⼀点：如上所述，cuTree是从后往前推导，求qpOffset。x264、x265在开启码控时，会启⽤lookahead机制，所谓lookahead机制就是从当前帧往后看，根据后续帧的情况，给当前帧分配合适的QP，确定合适的帧类型等。代码中，cuTree往后看的帧数就等于lookahead_num的值。⽐如对x265张月多高

的Preset Medium，lookahead_num默认为20，则cuTree会从当前帧之后第20帧开始往前推导，⼀直到当前帧，算出qpOffset，所以lookahead_num会对cuTree的结果有直接影响：不同lookahead_num，cuTree的QPOffset 的值也稍有不同，但是影响不算很⼤。

此外，还需要注意，x265的帧型决策以及cuTree的QPOffset的确定过程都是以MiniGop为单位的，即每次为⼀个MiniGop确定好编码所需的参数。因此，3个B帧的情况，每4帧(bBbP)调⽤⼀次cuTree过程。⽽0个B帧时，则每个P帧都要调⽤⼀次cuTree过程。cuTree每次要反推20(lookahead_num)帧，计算量很可观。所以对超⾼分辨率编码时，有时0B反⽽⽐3B更慢，问题很可能出于此。

x264、x265中cuTree原理分析

发布评论取消回复

最近发表

热门文章

标签列表