Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 21 |
Nodes: | 6 (0 / 6) |
Uptime: | 27:44:01 |
Calls: | 139 |
Files: | 91 |
Messages: | 42,665 |
MitchAlsup1 wrote:
So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format. Thus, in
order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.
I agree that fp10 is probably the shortest sane/useful version, but
1:3:4 does in fact contain enough exponent and mantissa bits to be
considered an ieee754 format.
On 8/3/2024 12:32 AM, Lawrence D'Oliveiro wrote:
But, what I am saying is, a lot of stuff doesn't need raytracing.
Movies and similar often use raytracing, but a lot of stuff doesn't need
it. One can often tell visually what is or is not raytracing.
Also, sometimes raytracing doesn't work out ideal either:
https://m.media-amazon.com/images/M/MV5BNmZhZGU2Y2QtYjc2NC00MzE5LWE3OGItMTFmOGU4NzVjM2YyXkEyXkFqcGdeQXRyYW5zY29kZS13b3JrZmxvdw@@._V1_.jpg
https://m.media-amazon.com/images/S/pv-target-images/ccbc22354235eb619e2ad63c56a7c89bdee327fb1e7a2cf0b965c91b2e5336cc.jpg
There is now RTX, which can do hardware raytracing.
It takes a bit more than cel-shading to get a good 2D look with 3D
software.
<https://www.youtube.com/watch?v=pKmSdY56VtY>
Possibly, but this is not typical from what I have seen.
I think compute shaders can be used for some raytracing...
On 8/3/2024 2:20 PM, Lawrence D'Oliveiro wrote:
On Sat, 3 Aug 2024 03:01:16 -0500, BGB wrote:
On 8/3/2024 12:32 AM, Lawrence D'Oliveiro wrote:
But, what I am saying is, a lot of stuff doesn't need raytracing.
Like I said, there are non-raytraced renderers inspired by video-game
engines. They don’t use OpenGL.
They are becoming few and far between wrt the big titles, so to speak.
DirectX 12 and Vulkan.
For mainstream games, there has been a move away from both OpenGL and Direct3D towards Vulkan.
Arguably, in some ways Vulkan is "better" for high-end games on modern
GPUs, but is a lot harder to use (the amount of work needed to get even
basic rendering working is very high).
Similarly, some targets, such as Android and Raspbian, were using GLES2 rather than OpenGL proper.
Contrast, OpenGL 1.x has a lower barrier to entry; and makes some sense
as a more general purpose graphics API (can also be used for GUI and
other things).
One can argue though that OpenGL is arguably a heavyweight option for
general GUI rendering. An intermediate option could be an API more
focused on 2D UI drawing tasks ...
Nothing is stopping them from being used for offline rendering.
On Sat, 3 Aug 2024 11:40:23 +0200, Terje Mathisen wrote:
MitchAlsup1 wrote:
So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format. Thus, in
order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.
I agree that fp10 is probably the shortest sane/useful version, but
1:3:4 does in fact contain enough exponent and mantissa bits to be
considered an ieee754 format.
The AI folks are quite happy with 8-bit floats for many applications. In >fact, they prefer more exponent bits and fewer in the mantissa.
On 8/5/2024 11:24 AM, George Neuner wrote:
On Sat, 3 Aug 2024 21:09:43 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Sat, 3 Aug 2024 11:40:23 +0200, Terje Mathisen wrote:
MitchAlsup1 wrote:
So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format. Thus, in >>>>> order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.
I agree that fp10 is probably the shortest sane/useful version, but
1:3:4 does in fact contain enough exponent and mantissa bits to be
considered an ieee754 format.
The AI folks are quite happy with 8-bit floats for many applications. In >>> fact, they prefer more exponent bits and fewer in the mantissa.
Insufficient precision is one of the many reasons that ANNs are prone
to hallucinate.
Also likely depends on the type of NN as well.
As noted, for some of the stuff I had tried doing, there was a
noticeable detrimental effect with fewer than around 8 to 10 bits in the >mantissa for the accumulator. Weights and biases could use fewer bits
(as could the inputs/outputs between layers), but not so much the >accumulator.
Whereas, large exponent ranges tended to be much less of a factor
(though with training via genetic algos, it was needed to detect and
handle some cases where values went outside of a "reasonable" exponent
range, such as E+14 or so).
One other thing I had found was that it was possible to DC-bias the
inputs (before multiplying against the weight), but the gains were small.
So, say, for each input:
(In+InBias)*Weight
Then, output:
OutFunc(Accum*OutGain+OutBias)
Though, OutGain is also debatable (as is InBias), but both seem to help >slightly. Theoretically, they are unnecessary as far as the math goes
(and what gains they offer are more likely a product of numerical
precision and the training process).
Will note that for transfer functions, I have tended to use one of:
SQRT: (x>0)?sqrt(x):0
ReLU: (x>0)?x:0
SSQRT: (x>0)?sqrt(x):-sqrt(-x)
Heaviside: (x>0)?1:0
While tanh is traditionally popular, it had little obvious advantage
over SSQRT and lacks a cheap approximation (and numerical accuracy
doesn't really matter here).
...
So, say, we have common formats:
Binary64, S.E11.F52, Common Use
Binary32, S.E8.F23, Common Use
Binary16, S.E5.F10, Less Common Use
But, things get funky below this:
A-Law: S.E3.F4 (Bias=8)
FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
FP8U: E4.F4 (Bias=7)
FP8S: E4.F3.S (Bias=7)
Semi-absent in my case:
BFloat16: S.E8.F7
Can be faked in software in my case using Shuffle ops.
NVIDIA E5M2 (S.E5.F2)
Could be faked using RGBA32 pack/unpack ops.
No immediate plans to add these later cases as (usually) I have a need
for more precision than more exponent range. The main seeming merit of
these formats being that they are truncated forms of the wider formats.
No need to elaborate on the use-cases for Binary32 and Binary64, wide
and varied.
Binary16 is useful for graphicsprobably,
and audio processing.
Seemingly IEEE specifies it mostly for storage and not for computation, but for these
cases it is good enough for computation as well.
Binary16 is mostly sufficient for 3D model geometry, and for small 3D
scenes, but not really for 3D computations or larger scenes (using it
for transform or projection matrices or matrix multiply does not give acceptable results).
Does work well for fast sin/cos lookup tables (if supported natively),
say, because the error of storing an angle as 1/256 of a circle is
larger than the error introduced by the 10 bit mantissa.
I had also used it as the computational format in a lot of my neural-net experiments.
The 8-bit formats get a bit more niche; main use-cases mostly to save
memory.
FP8s originally exists because it was cheaper to encode/decode alongside FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has
been re-added. I couldn't simply entirely replace FP8S back with FP8,
partly as it seems my existing binaries depend on FP8S in a few places,
and so simply replacing it would have broken my existing binaries.
So, options were to either add some separate ops for FP8, or just live
with using my wonky/non-standard FP8S format (or break my existing
binaries). Ended up deciding to re-add FP8.
FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a
few other things. The variant used in my case is seemingly fairly
similar to that used by NVIDIA and PyTorch.
Unlike the minifloat format described on Wikipedia (which had defined it
as following IEEE 754 rules), it differs from IEEE rules in the handling
of large and small values. No separate Inf/NaN range, rather the largest value serves as an implicit combined Inf/NaN, with the smallest value understood as 0.
The main difference here between FP8 and FP8S being the location of the
sign bit (putting it in the LSB initially allowed avoiding some MUX'ing
when paired with FP8U).
The re-added FP8 was instead overlapped with the unpack logic used for
A-Law (even with the obvious difference...).
The encoder-side logic for FP8 can be implemented by taking the FP8S
output and reordering the bits (in an "assign"). Though, doing this on
the decoder input side would not likely have saved anything (attempts to
MUX on the input side seemingly tend to result in duplicating any LUTs
that follows afterwards).
Though, one could almost argue for combining all 4 cases into shared encoder/decoder modules (well, since at least 3/4 of the formats have
the mantissa and exponent bits in the same place, FP8 being the odd one
out; and A-Law being off-by-1 in terms of Bias).
This appears to be similar to what NV and PyTorch used, and also
overlaps with my handling of A-Law (though, the largest possible value
of A-Law is understood as ~ 0.969).
Where, A-Law has slightly higher precision, but is normally limited to
unit range. Main use-case is in representing audio, but was sometimes
also used when a small unit-range format was needed and precision wasn't
a priority.
For example, with slight fudging, it can be used to store
unit-quaternions, among other things. It is basically accurate enough to store things like object orientations and 3D camera rotations. Though, generally, it is needed to normalize the quaternion after unpacking it.
Ironically, for A-Law, my implementation and typical use differs from
how it is usually stored in WAV files, in that in WAV files it is
generally XOR'ed with 0x55, but this is an easy enough fix when loading
audio data or similar.
There is also u-Law, but u-Law isn't really a minifloat format.
These formats can also be used for pixel data; though FP8U often made
more sense for RGBA values (generally, negative RGBA isn't really a
thing).
However, pixel values may go outside unit range, so A-Law doesn't work
for HDR pixel data. The use of FP8 or FP8S works, but gives lower
quality than FP8U. Here, FP8U gives slightly better quality than RGB555
over LDR range, whereas FP8 or FP8S is slightly worse for bright values
(1 bit less accuracy between 0.5 and 1.0).
For normal bitmap graphics, I am mostly using RGB555 at present though.
There isn't yet a fast conversion path between RGB555 and floating-point formats, but, say:
RGB5UPCK64 //Unpack RGB555 to 4x WORD
PCVTUW2H //Packed Word to Half (1.0 .. 2.0)
PADD.H //To adjust DC bias to 0.0 .. 1.0.
? PSTCM8UH //to FP8U (typical option for HDR RGBA pixel data)
? PSTCF8H //to FP8 (newly added)
But, the crufty Word<->Half SIMD conversions exist mostly because it
would have been more expensive to support "better" SIMD converters (the
DC bias offset allowed doing the conversions via repacking the bits;
whereas unit-range conversions would have required the more expensive
path of adding the format conversion logic to the SIMD FADD units).
Note that most of the SIMD format converters exist as applied use of bit-twiddling (and generally no rounding or similar, as rounding would
add considerable amounts of cost here...).
Though, cases needing fast conversion of pixel data between RGB555 and floating-point forms have been uncommon (most pixel math starting from
RGB555 tends to remain on the integer side of things).
If TKRA-GL were using HDR, most likely option here is:
If HDR is used;
The program binds an LDR texture.
The GL backend can internally quietly generate an HDR version of the
texture and use this instead; as opposed to trying to dynamically
transform RGB555 or UTX2 into HDR during texel load.
Though, another option would be to base it on the GL context:
If the OpenGL framebuffer is HDR;
All uploaded textures get converted to HDR formats as well.
So, RGB555/RGBA8888/... -> FP8U, and DXT1/DXT5/BC6H/BC7 -> UTX3.
....
For things like streaming PCM audio to the audio hardware, say:
2x PSHUF.W+MOVxxD //Shuffle from Stereo to 1x/2x Mono
PCVTSW2H //Packed Word to Half (2.0 .. 4.0)
PADD.H //To adjust DC bias to -1.0 .. 1.0.
PCVTH2AL //Convert Half to A-Law
Where, the programs and API use 16-bit stereo PCM, and my audio hardware generally uses separate Left/Right A-Law for the loop buffers.
A-Law was used mostly because:
8-bit linear PCM tends to sound like garbage;
16-bit PCM needs twice the Block-RAM (relative to sample rate);
A-Law quality is closer to 16-bit PCM, at the same size as 8-bit PCM.
So, I ended up designing the audio hardware to use A-Law.
But, on a 50MHz CPU, the CPU is slow enough that one has reason to care
about how many clock-cycles are used by the audio encoding (doing it in software was slow; so ended up doing it via SIMD).
Generally, most audio mixing code has tended to use 16-bit PCM, as using Binary16 or Binary32 for front-end audio mixing is a bit of a novelty. Wouldn't be that hard to support in theory, would just need to be
expressed via the WAVEFORMATEX structure (and, assuming the backend code
was added to support raw floating-point PCM).
The API does also support 8-bit PCM, but this is the worst case quality
wise (combining both the initial poorness of 8-bit PCM with some
additional information loss in the conversion to A-Law).
Though, 8-bit PCM is still acceptable for use in sound-effects and
similar. When mixed into a PCM buffer, typically amplitude and DC bias
is all over the place.
Had (early on) experimented with possible "small block" audio
compression (and an ADPCM variant) for the audio hardware, but couldn't really get acceptable results. A-Law seemed to be the most reasonable compromise (in terms of encoding cost and "didn't sound like crap").
While ADPCM can give OK quality relative to size, it was a rather poor
fit for the use-case (it is much better as an "offline" audio storage format).
These 8-bit floating-point formats are generally too poor in terms of
quality to be used for direct computation in SIMD operations.
Some stuff online implies that FP8 could be used as an intermediate computation format in neural nets, but my own past experiments in these
areas implied that FP8 was insufficient (for good results, one seemingly needs around 7 or 8 bits for the mantissa).
Granted, this was mostly with trying to use NN's for image processing
tasks (which likely have higher accuracy requirements than, say, LLMs).
However, FP8 can work OK for weights. Some experiments had used A-Law,
but I can note that A-Law requires to add an extra scaling step before
adding the bias and invoking an activation function (this could be
avoided with FP8).
For image-filtering NNs, seems to be better to work primarily using
Binary16 and ReLU activation or similar.
Though, the "approximate ssqrt" can work OK (where approximate ssqrt is roughly comparable to "tanh", but cheaper to calculate). The
"approximate" part being that, by usual definition, one can leave off
the Newton-Raphson iteration stages.
Well, in a similar way to how, in graphics processing, it can sometimes
be useful to redefine Binary16 divide "A/B" as roughly "A*(0x7800-B)"
(if the speed of the divide matters more than the accuracy of the
result).
Though, generally makes sense to train the net with the same range and precision intended to run it (so, if it is going to be run as Binary16
with approximate operators, it also needs to be trained using Binary16
and approximate operators).
Though, moderately annoying for "normal C on a desktop PC", as both
Binary16 and FP8 are absent and will need to be faked in software.
Ended up going with FP8 for a "packed multiply expanding" instruction:
PMUL.F8H Rs, Rt, Rn
Where, each FP8 in Rs and Rt is multiplied, and the result expands to a Binary16 element in Rn.
Ended up not going with FMAC, as it is likely the cost and latency would
have been a bit higher than I would like (and possibly higher than the "inline shuffle" experiment).
The "PMUL.F8H" instruction was added with a 2-cycle latency, and seems
to have a moderately low cost (no obvious impact on overall LUT costs). However, its logic is still complicated enough that I wouldn't want to
try adding it as a 1-cycle operation.
As one merit of using FP8, the 3-bit mantissa is small enough that the
pair of mantissas can directly use LUT6 lookups (and most of the cost is likely along the exponent path).
But, don't know if this would have much use out of potentially being
useful for neural-net code.
But, any thoughts?...
Binary16 is useful for graphics and audio processing.
The 8-bit formats get a bit more niche; main use-cases mostly to save
memory.
On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:
No need to elaborate on the use-cases for Binary32 and Binary64, wide
and varied.
There is a growing clamor for 128-bit FP, too.
IME, typically OpenGL HDR framebuffers used 4x Binary16.
MitchAlsup1 wrote:
On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:
No need to elaborate on the use-cases for Binary32 and Binary64, wide
and varied.
There is a growing clamor for 128-bit FP, too.
Looking in a PDP-11 handbook from 1981, F (32-bit) and D (64-bit)
formats
there was a significant difference in operation speed for MUL and DIV.
But since the 8087 people have got used to no significant difference
(likely due to its internal use of FP80 for everything).
With FP128 will there again be a significant difference in speed to
FP62 or FP32 (including transcendentals)? Seems there would be because
not
every HW implementation is going to implement a full width multiplier.
There are shadows, but given there are not any obvious gradients in
the shadows; they could likely have been pulled off using the
stencil buffer.
Slightly older, same company: https://www.youtube.com/watch?v=MfT2MaaukrQ&list=PLV4Ztn9euy7TJJrTgY5jgswEo_GY0ZHqj
The show is ReBoot...
On 8/1/2024 6:09 PM, Lawrence D'Oliveiro wrote:
On Thu, 1 Aug 2024 04:49:50 -0500, BGB wrote:
IME, typically OpenGL HDR framebuffers used 4x Binary16.
OpenGL is only for on-screen stuff.
You can use it for offline rendering as well ...
With FP128 will there again be a significant difference in speed to
FP62 or FP32 (including transcendentals)? Seems there would be because not every HW implementation is going to implement a full width multiplier.
For batch rendering, it doesn't really need to be efficient though.
General strategy would be to setup a context, render to the context, and
then fetch the rendered image using glReadPixels or similar.
The "glReadPixels()" function is not exactly new...
Taking a screen shot? glReadPixels is okay, right?
Around the time of OpenGL 3.0, they added FBO's (FrameBuffer Objects),
which allowed rendering directly into a texture without needing to use glReadPixels followed by glTexImage2D.
On 8/2/2024 8:35 PM, Lawrence D'Oliveiro wrote:
On Fri, 2 Aug 2024 17:18:17 -0700, Chris M. Thomasson wrote:
Taking a screen shot? glReadPixels is okay, right?
That means grabbing the whole screen, regardless of what else it might
be showing, assuming your window it not occluded, and it further
assumes you have a screen to grab from. This kind of precludes running
the renderer as a batch, background process.
The glReadPixels call doesn't grab an image from the OS desktop, but
rather from one of the internal framebuffers associated with the OpenGL context.
On 8/2/2024 7:07 PM, Lawrence D'Oliveiro wrote:
On Fri, 2 Aug 2024 03:56:35 -0500, BGB wrote:You don't do raytracing in OpenGL...
For batch rendering, it doesn't really need to be efficient though.
Of course it does. Ray-traced renderers can compute hundreds or
thousands of rays per pixel, taking anywhere from minutes to hours per
frame. This is how you produce those cinema-quality 4K (or even 8K)
frames. Anything that can shave a little bit off the time for a single
ray computation can very quickly add up.
That is not the point of using it.
Similarly, not everything needs raytracing.
For many uses, rasterization is fine.
For a lot of shows, you don't even really need 3D.
One might instead go the other direction, and use a 3D renderer with a cel-shading filter to try to make it fit better with the anime
characters ...
On Wed, 31 Jul 2024 18:31:35 -0500, BGB wrote:
Binary16 is useful for graphics and audio processing.
The common format for CG work is OpenEXR, and that allows for 32-bit and
even 64-bit floats, per pixel component. So for example R-G-B-Alpha is 4 components.
The 8-bit formats get a bit more niche; main use-cases mostly to save
memory.
Heavily used in AI work.
On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:
So, say, we have common formats:
Binary64, S.E11.F52, Common Use
Binary32, S.E8.F23, Common Use
Binary16, S.E5.F10, Less Common Use
But, things get funky below this:
A-Law: S.E3.F4 (Bias=8)
FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
FP8U: E4.F4 (Bias=7)
FP8S: E4.F3.S (Bias=7)
Semi-absent in my case:
BFloat16: S.E8.F7
Can be faked in software in my case using Shuffle ops.
NVIDIA E5M2 (S.E5.F2)
Could be faked using RGBA32 pack/unpack ops.
So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format.
Thus, in order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.