This updates uops_macros.h and the graph.c implementation in lockstep,
otherwise we'd have an intermediate commit with a bunch of broken formats.
Overall speedup=1.008x faster, min=0.144x max=5.550x
The min/max numbers are mostly measurement noise, but the real speedup for
affected formats is anywhere from 0.9x to around 2x-3x.
It's worth noting that the speedup for the formats which currently regress
is because we don't yet refcopy the planes, but I have another series in the
works which will take care of this soon.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This commit only adds the uop itself; it does not yet add any implementations.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
I decided to model this as a separate read/write type, rather than as a
separate operation (e.g. SWS_OP_PALETTE), because it makes the semantics
surrounding the read value range, plane pointer setup, etc. much cleaner.
(This will become evident in the upcoming changes to the dispatch layer)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
We also drop the useless/unused mask from the permute ops.
Avoids a bunch of otherwise duplicate permute ops. Now that this is
handled by SWS_UOP_MOVE for x86, there is no downside to this.
The FATE change is a pure rename of the uops dumps.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
The first computation in a linear row doesn't have anything to
accumulate to, so a multiply-accumulate instruction won't be used
either way. This led to identical functions being instantiated for
different params.
There is no easy optimization that can be triggered by knowing that the
offset is exactly 1. This led to identical functions being instantiated
for different params.
I want to start adding more data layouts, like semiplanar formats (nv12), or
palette formats. I made an effort to distinguish existing checks for rw.packed
into "mode != PLANAR" and "mode == PACKED", based on the intent of the
surrounding code, in anticipation of these new layouts.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of hard-coding SWS_PIXEL_F32 here. This is not really useful
yet, but I wanted to clean up the semantics here regardless.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a minor cosmetic improvement that allows me to use more
convenient names for a filter-related metadata fields, without
confusion.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This decomposes a swizzle mask into a series of optimal register-register
moves, using at most two temporary scratch registers.
This is a better match for ASM-style backends than the existing PERMUTE/COPY
uops that are designed for the needs of the C backend (or other backends which
either apply the swizzle mask directly or permute pointers).
I originally had logic equivalent to this written in NASM macros, but it was
just such a complicated mess that I think it's better to rewrite it in C and
have the resulting metadata be an explicit part of the uop definition.
This commit only adds the uop, I'll update the x86 implementation in the
next step.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: Niklas Haas <git@haasn.dev>
The old x86 backend was the only backend that actually mutated the ops list.
With this gone, we can constify this parameter.
Signed-off-by: Niklas Haas <git@haasn.dev>
The ops.h infrastructure currently hard-codes this as SWS_PIXEL_F32,
but I want to at least properly parametrize this in case we ever
decide to revisit this decision in the future. In particular, it
may become relevant for trivial kernels or kernels whose intermediates
are bounded, exact integers (which could possibly be output directly
as e.g. U16 or U32).
The FATE change is just because the filter op names gained a suffix.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Analog of SWS_UOP_READ_PLANAR_FV for FMA-enabled backends.
The logic for determining when we can safely use FMA is maybe a bit
obtuse, given that a `return type == SWS_PIXEL_U8` would have just done
the trick as well, but better to be safe than sorry, if we ever decide to
tune this constant in the future.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is like SWS_UOP_LINEAR but parametrized by which matrix entries can use
FMA instead of bitexact IEEE mul/add instructions.
I decided to make these a separate uop to avoid bogging down the reference
backend with arch-specific details like FMA. However, I think FMA ops are quite
common/universal so I pre-emptively split it into its own separate flag rather
than defining something like SWS_UOP_FLAG_X86.
Signed-off-by: Niklas Haas <git@haasn.dev>
And SWS_BITEXACT|SWS_ACCURATE_RND, for completeness. This roughly doubles
the runtime of the uops macros generation. Let's hope it doesn't explode
further.
Signed-off-by: Niklas Haas <git@haasn.dev>
This list is currently empty but will be expanded by the following commit.
I briefly tested whether it would be worth avoiding the free/realloc on
the uops array, but found the performance difference to be negligible.
Signed-off-by: Niklas Haas <git@haasn.dev>
This follows the same approach as is used currently by ops_entries_aarch64,
except I decided to have the generation logic live directly in uops.c
to allow re-using internal helpers and move it closer to the other helpers
that depend on the exact set of uops and their fields.
Unlike libswscale/tests/sws_ops.c, we make an effort to actually test all
relevant flag combinations, since these can affect the generated op lists.
I will use these macros to auto-generate both the C template-based kernels,
as well as the entire x86 backend, in the near future, hence their excessive
flexibility.
Re-use the libswscale/tests/sws_ops.c that we already compile. We could put it
in its own file but this is just as convenient, and it's easily moved anyways.
Having it be a FATE test ensures that it is always up-to-date.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will replace the fuzzy matching logic in op_match() that is used by the
C and x86 implementations, as well as the translation to AARCH64_OP_* that is
used by the NEON asmgen backend.
Down the line, this function will also take a set of flags to enable
backend-specific kernels like FMA variants, but I also decided to keep it
initially simple to ease the transition.
Signed-off-by: Niklas Haas <git@haasn.dev>
Taken from AARCH64_OP_*, but generalized/simplified a bit and updated to add
missing op types, especially for special cases that already have dedicated
implementations on x86.
This initial definition is kept intentionally simple and close to SwsOp, to
make it easier to port the existing ops backends to the new infrastructure.
However, in the future, this will be refactored dramatically - distinctions
like convert vs expand will cease to exist on the SwsOp level, and will
instead be introduced by separate optimization passes on the uops level.
SWS_UOP_LINEAR in particular will most likely be broken up into multiple
uops. I also took this opportunity to redefine the mask in a more useful way.
I decided to split up SWS_OP_CONVERT as well, because it was making x86
codegen unnecessarily difficult due to the strong interaction between exact
pixel sizes.
Signed-off-by: Niklas Haas <git@haasn.dev>