This fallback function is used if external MMX is available,
while inline MMX and intrinsics for emitting emms are unavailable.
It is implemented as an avpriv function, which has several
drawbacks for shared builds:
1. The function is so small (3 bytes; 16 with padding)
that the overhead of exporting and importing it dwarfs
the gains from code deduplication.
2. A call to an external library has more overhead than
a library-internal one.
3. It may cause linking failures when a libavutil not exporting
avpriv_emms_asm() is paired with a library needing it
(if inline assembly and intrinsics were unavailable when building
the dependent library). I am not aware of this ever happening.
4. We would be forced to keep avpriv_emms_asm() around for ABI stability
even after it is no longer needed.
This commit therefore uses the STLIBOBJS, SHLIBOBJS approach
to duplicating it into each library on its own if needed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This updates uops_macros.h and the graph.c implementation in lockstep,
otherwise we'd have an intermediate commit with a bunch of broken formats.
Overall speedup=1.008x faster, min=0.144x max=5.550x
The min/max numbers are mostly measurement noise, but the real speedup for
affected formats is anywhere from 0.9x to around 2x-3x.
It's worth noting that the speedup for the formats which currently regress
is because we don't yet refcopy the planes, but I have another series in the
works which will take care of this soon.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This already helps performance as-is, but will help performance massively
once we add the ability for the memcpy backend to do a refcopy instead of
an actual copy.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Can be used to extract a reduced subset of operations affecting only certain
output planes, e.g. splitting an op list into a "memcpy" and a "non-memcpy"
part, or splitting apart op lists for independent or subsampled planes.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
If the filter cannot actually be optimized into the read (for whatever
reason), this code would previously loop infinitely. Bail out cleanly
instead.
The FFSWAP is there to make the error message print the remainder (the one
containing unsplittable ops), rather than the noop list.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of a loop with fixed structure, this function now recursively calls
itself as many times as needed to satisfy all criteria.
This is absolutely needed for the upcoming refactor which will allow for
also splitting apart ops lists as needed to e.g. handle partially subsampled
ops lists, which may need a complex sequence of filtering and merge steps
to be fully satisfied.
This does modify the way in which subpasses are compiled slightly, in that
each new subpass first tried again un-split, rather than a single split
resulting in all subsequent passes being split as well. This is mostly a
benign change, though it might matter one day.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
We already have the unoptimized reference ops; printing each intermediate
stage here is just noise that makes this file harder to scroll through IMO.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This will make it easier to keep passing around these parameters in helper
functions in the upcoming refactor.
Take the opportunity to also rename the plain `compile` function to
`compile_single`.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Otherwise, this will false negative if the redundant operations haven't
been optimized away yet, resulting in unnecessary memcpy operations.
Fixes: a534156083
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Makes ff_sws_compile_pass() more robust; will be needed for plane splitting.
Besides, it's perfectly valid to have an operation list that starts with
e.g. SWS_OP_CLEAR.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This prevents the generation of a few more duplicate functions (where
there would be both f32 and u32 functions).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
There is no easy optimization that can be triggered by knowing that the
offset is exactly 1. This led to identical functions being instantiated
for different params.
Also simplified the AVRational comparisons a bit.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The mask for swizzle ops assumed that merely having a component assigned
to itself was enough to detect whether the swizzle was needed for that
component, but that wasn't correct. We should also take into account
whether the component is needed for the next operation or not.
Additionally, prevent duplicate functions from being generated by
clearing the swizzle index for unused components.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
These functions are essentially the same as single-component planar
read/write, and are actually never instantiated. This was left over
from the initial implementation.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The AVX2 is a fairly straightforward vpgatherdd + 4x4 transpose. The SSE4
fallback is an unrolled scalar loop, for lack of anything better to do.
checkasm:
- CPU: AMD Ryzen 9 9950X3D 16-Core Processor (00B40F40)
- Timing source: x86 (rdtsc)
- Bench duration: 10000 µs per function (45898205 cycles)
- Random seed: 2518020648
Benchmark results:
name cycles (vs ref)
u8_read_palette_xyzw_c: 2877.5
u8_read_palette_xyzw_x86_sse4: 1951.9 ( 1.47x)
u8_read_palette_xyzw_x86_avx2: 1051.6 ( 2.74x)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is handled using the new SWS_RW_PALETTE read op mode. We need to be a bit
careful to use the correct pixfmt descriptor downstream, because the descriptor
for PAL8 itself merely describes the *index*, rather than the actual data
values.
Accomplish this by introducing a new function to map the palette format to the
resulting pixel format after applying the palette (explicitly documented as
AV_PIX_FMT_RGB32).
+pal8 16x16 -> rgb24 16x16:
+ [ u8 +++X] SWS_OP_READ : 4 elem(s) palette >> 0
+ min: {0 0 0 _}, max: {255 255 255 _}
+ [ u8 +++X] SWS_OP_SWIZZLE : 2103
+ min: {0 0 0 _}, max: {255 255 255 _}
+ [ u8 XXXX] SWS_OP_WRITE : 3 elem(s) packed >> 0
+ (X = unused, z = byteswapped, + = exact, 0 = zero)
+ translated micro-ops:
+ u8_read_palette_xyzw
+ u8_permute_xz_zx
+ u8_write_packed_xyz
...
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This does not actually generate any code yet as the macro is still empty,
but that will change once I add support for generated palette reads to
the format handling code. This logic merely needs to be in place first
to avoid introducing broken intermediate states where palette uops are
generated but not implemented by the reference backend.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This commit only adds the uop itself; it does not yet add any implementations.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This requires some tiny bit of extra setup work from the dispatch layer.
Specifically, we need to arrange for the palette data pointer to end up in
exec.in[1], and to disable the pointer advancement logic for this plane (this
can be accomplished by just setting the stride and bump to 0).
We also want to disable the tail buffer / overflow pixel copying logic for
the palette, which can be accomplished by ensuring that p->planes_in only
contains the number of *data* planes, excluding the fixed palette.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
In theory, we could learn to handle them internally, using the same
systematic palette trick, but I'll defer this for now, as vf_scale already
handles this internally.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
I decided to model this as a separate read/write type, rather than as a
separate operation (e.g. SWS_OP_PALETTE), because it makes the semantics
surrounding the read value range, plane pointer setup, etc. much cleaner.
(This will become evident in the upcoming changes to the dispatch layer)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
We also drop the useless/unused mask from the permute ops.
Avoids a bunch of otherwise duplicate permute ops. Now that this is
handled by SWS_UOP_MOVE for x86, there is no downside to this.
The FATE change is a pure rename of the uops dumps.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This should be matching against the *chroma* scaler, not the main scaler.
Of course, under normal circumstances, scaler_sub matches scaler, but this
allows users to explicitly override this defaulting by setting e.g.
-scaler none -scaler_sub bicubic
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Odd-size luma planes are not exact multiples of the chroma plane; but the
sample grid is still matched as though it were. We need to account for this
when translating a luma sample to the corresponding chroma sample coordinates.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is needed for chroma subsampling, which requires a different filter
offset for chroma subsamples (according to the frame's chroma location).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This metadata is needed to compute the correct chroma sampling offsets.
We previously stored this in graph->field, but that's a bad place for it,
because it doesn't survive the translation to the ops abstraction layer.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
I can't say I remember why this logic was written this way, but I can't
think of any good reason why we should exclude comparing the image
dimensions here - the intent is obviously to allow passthrough / noop.
Signed-off-by: Niklas Haas <git@haasn.dev>
This allows adding passes which will be dispatched over a reduced number of
lines, without affecting the allocated buffer dimensions - e.g. for passes
which purely write to subsampled chroma planes.
A few hard-coded references to pass->width/height need to be replaced by
the corresponding output frame references, but it's not a huge deal.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is not only wasteful but also serves no real purpose. Looping over
the correct number of lines is trivial; there is far less point in vertical
padding than horizontal padding.
Furthermore, this might actually introduce issues when linking output buffers;
since the extra padding depends on the pass's alignment and threading
requirements, which may differ from pass to pass.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Replaces a few "nan" value ranges by real values, and drops a bunch of
redundant non-FMA variants that resulted from this bug.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
We can still pre-fill the prev array here; ff_sws_apply_op_q() is a no-op.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>