• Re: Misc: Applications of small floating point formats.

    From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Sat Aug 3 21:09:43 2024
    On Sat, 3 Aug 2024 11:40:23 +0200, Terje Mathisen wrote:

    MitchAlsup1 wrote:

    So, you have identified the problem:: 8-bits contains insufficient
    exponent and fraction widths to be considered standard format. Thus, in
    order to utilize 8-bit FP one needs several incarnations.
    This just points back at the problem:: FP needs at least 10 bits.

    I agree that fp10 is probably the shortest sane/useful version, but
    1:3:4 does in fact contain enough exponent and mantissa bits to be
    considered an ieee754 format.

    The AI folks are quite happy with 8-bit floats for many applications. In
    fact, they prefer more exponent bits and fewer in the mantissa.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 3 21:20:25 2024
    On Sat, 3 Aug 2024 03:01:16 -0500, BGB wrote:

    On 8/3/2024 12:32 AM, Lawrence D'Oliveiro wrote:

    But, what I am saying is, a lot of stuff doesn't need raytracing.

    Like I said, there are non-raytraced renderers inspired by video-game
    engines. They don’t use OpenGL. And it does take extra tricks (e.g.
    things called “light probes”) to get realistic-looking lighting with
    them.

    Movies and similar often use raytracing, but a lot of stuff doesn't need
    it. One can often tell visually what is or is not raytracing.

    In the hands of skilled practitioners, it’s very likely you can’t. You probably can’t even tell which scenes, or parts of scenes, are CG and
    which are actual physical sets.

    Also, sometimes raytracing doesn't work out ideal either:

    https://m.media-amazon.com/images/M/MV5BNmZhZGU2Y2QtYjc2NC00MzE5LWE3OGItMTFmOGU4NzVjM2YyXkEyXkFqcGdeQXRyYW5zY29kZS13b3JrZmxvdw@@._V1_.jpg

    https://m.media-amazon.com/images/S/pv-target-images/ccbc22354235eb619e2ad63c56a7c89bdee327fb1e7a2cf0b965c91b2e5336cc.jpg

    The usual ray-tracing artifacts are called “fireflies” (random bright
    dots in the image). I don’t see them there. Are you complaining about
    the images being deliberately grungy?

    There is now RTX, which can do hardware raytracing.

    RTX is a joke. It does maybe one or two rays per pixel, and then
    covers up the artifacts with heavy use of denoising.

    It takes a bit more than cel-shading to get a good 2D look with 3D
    software.

    <https://www.youtube.com/watch?v=pKmSdY56VtY>

    Possibly, but this is not typical from what I have seen.

    Even more “not typical”:
    <https://www.youtube.com/watch?v=hzqD4xcbEuE>. Check out that Dédouze
    guy; his work is just amazing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Chris M. Thomasson on Sun Aug 4 04:12:40 2024
    On Sat, 3 Aug 2024 14:33:01 -0700, Chris M. Thomasson wrote:

    I think compute shaders can be used for some raytracing...

    And compute shaders were the origin of the idea that became OpenCL/SYCL, getting rid of the graphics-drawing part and just keeping the SIMD-
    oriented computing part.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Chris M. Thomasson on Sun Aug 4 04:14:13 2024
    On Sat, 3 Aug 2024 14:38:11 -0700, Chris M. Thomasson wrote:

    On 8/3/2024 2:20 PM, Lawrence D'Oliveiro wrote:

    On Sat, 3 Aug 2024 03:01:16 -0500, BGB wrote:

    On 8/3/2024 12:32 AM, Lawrence D'Oliveiro wrote:

    But, what I am saying is, a lot of stuff doesn't need raytracing.

    Like I said, there are non-raytraced renderers inspired by video-game
    engines. They don’t use OpenGL.

    They are becoming few and far between wrt the big titles, so to speak.

    No they aren’t. A well-known example is the “Eevee” renderer built into Blender (alongside the ray-traced “Cycles” renderer).

    DirectX 12 and Vulkan.

    Those are strictly for on-screen use, like OpenGL before them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Aug 5 01:20:22 2024
    On Sun, 4 Aug 2024 16:50:50 -0500, BGB wrote:

    For mainstream games, there has been a move away from both OpenGL and Direct3D towards Vulkan.

    Arguably, in some ways Vulkan is "better" for high-end games on modern
    GPUs, but is a lot harder to use (the amount of work needed to get even
    basic rendering working is very high).

    Similarly, some targets, such as Android and Raspbian, were using GLES2 rather than OpenGL proper.

    There is a definite line of evolution OpenGL → OpenGL ES → Vulkan.

    The original OpenGL had a fixed-function pipeline with a fixed number of
    lights and material characteristics and so on.

    Then later “shaders” were introduced, written in a special language
    (GLSL), so that you could define your lighting and materials how you
    liked. (I think the idea of a shader language originated with Pixar’s RenderMan.)

    Then OpenGL ES got rid of the old fixed-function pipeline, and required
    you to use the shader functionality. Interesting that GL ES 2.0 (and
    later) was not backward-compatible with GL ES 1.x in this regard.

    And of course Vulkan takes the whole idea to its logical conclusion.

    On the downside, it makes things harder for the newbie wanting to learn
    about all this stuff. Luckily, the old OpenGL APIs haven’t completely gone away (yet), so you can still start your learning on the fixed-function pipeline, then add some shaders, then once you are proficient in those,
    you can drop the fixed-function training wheels and take flight.

    Contrast, OpenGL 1.x has a lower barrier to entry; and makes some sense
    as a more general purpose graphics API (can also be used for GUI and
    other things).

    Blender does its entire GUI with OpenGL, and the minimum version it
    requires is 4.3 nowadays.

    One can argue though that OpenGL is arguably a heavyweight option for
    general GUI rendering. An intermediate option could be an API more
    focused on 2D UI drawing tasks ...

    That’s how things used to be done, back in the 1990s or so. Then it was realized that the graphics card vendors really only needed to worry about
    3D acceleration, because the 3D APIs worked perfectly well for 2D work. I
    think it was Apple than pioneered this idea in OS X, though it was very
    quickly adopted by other platforms.

    Nothing is stopping them from being used for offline rendering.

    They are not really best thought of as “rendering” APIs, they are just “drawing” APIs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ldo@nz.invalid on Mon Aug 5 12:24:04 2024
    On Sat, 3 Aug 2024 21:09:43 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Sat, 3 Aug 2024 11:40:23 +0200, Terje Mathisen wrote:

    MitchAlsup1 wrote:

    So, you have identified the problem:: 8-bits contains insufficient
    exponent and fraction widths to be considered standard format. Thus, in
    order to utilize 8-bit FP one needs several incarnations.
    This just points back at the problem:: FP needs at least 10 bits.

    I agree that fp10 is probably the shortest sane/useful version, but
    1:3:4 does in fact contain enough exponent and mantissa bits to be
    considered an ieee754 format.

    The AI folks are quite happy with 8-bit floats for many applications. In >fact, they prefer more exponent bits and fewer in the mantissa.

    Insufficient precision is one of the many reasons that ANNs are prone
    to hallucinate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to bohannonindustriesllc@gmail.com on Tue Aug 6 11:51:57 2024
    On Mon, 5 Aug 2024 17:35:22 -0500, BGB-Alt
    <bohannonindustriesllc@gmail.com> wrote:

    On 8/5/2024 11:24 AM, George Neuner wrote:
    On Sat, 3 Aug 2024 21:09:43 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Sat, 3 Aug 2024 11:40:23 +0200, Terje Mathisen wrote:

    MitchAlsup1 wrote:

    So, you have identified the problem:: 8-bits contains insufficient
    exponent and fraction widths to be considered standard format. Thus, in >>>>> order to utilize 8-bit FP one needs several incarnations.
    This just points back at the problem:: FP needs at least 10 bits.

    I agree that fp10 is probably the shortest sane/useful version, but
    1:3:4 does in fact contain enough exponent and mantissa bits to be
    considered an ieee754 format.

    The AI folks are quite happy with 8-bit floats for many applications. In >>> fact, they prefer more exponent bits and fewer in the mantissa.

    Insufficient precision is one of the many reasons that ANNs are prone
    to hallucinate.

    Also likely depends on the type of NN as well.

    As noted, for some of the stuff I had tried doing, there was a
    noticeable detrimental effect with fewer than around 8 to 10 bits in the >mantissa for the accumulator. Weights and biases could use fewer bits
    (as could the inputs/outputs between layers), but not so much the >accumulator.

    Whereas, large exponent ranges tended to be much less of a factor
    (though with training via genetic algos, it was needed to detect and
    handle some cases where values went outside of a "reasonable" exponent
    range, such as E+14 or so).

    You can use more precision in the mantissa, or more range in the
    exponent ... generally you don't need both ;-) ... but in either you
    do need *enough* bits.

    The problem with 8-bit reals is they have neither enough precision nor
    enough range - they too easily can be saturated during training, and
    even if the values are (re)normalized afterward, the damage already
    has been done.

    16-bit values seem to enough for many uses. It does not matter much
    how the bits are split mantissa vs exponent ... what matters is having
    enough relevant (to the algorithm) bits to avoid values being
    saturated during training.


    One other thing I had found was that it was possible to DC-bias the
    inputs (before multiplying against the weight), but the gains were small.


    So, say, for each input:
    (In+InBias)*Weight
    Then, output:
    OutFunc(Accum*OutGain+OutBias)

    Though, OutGain is also debatable (as is InBias), but both seem to help >slightly. Theoretically, they are unnecessary as far as the math goes
    (and what gains they offer are more likely a product of numerical
    precision and the training process).

    Will note that for transfer functions, I have tended to use one of:
    SQRT: (x>0)?sqrt(x):0
    ReLU: (x>0)?x:0
    SSQRT: (x>0)?sqrt(x):-sqrt(-x)
    Heaviside: (x>0)?1:0

    While tanh is traditionally popular, it had little obvious advantage
    over SSQRT and lacks a cheap approximation (and numerical accuracy
    doesn't really matter here).

    ...


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Aug 1 00:31:51 2024
    On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:

    So, say, we have common formats:
    Binary64, S.E11.F52, Common Use
    Binary32, S.E8.F23, Common Use
    Binary16, S.E5.F10, Less Common Use

    But, things get funky below this:
    A-Law: S.E3.F4 (Bias=8)
    FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
    FP8U: E4.F4 (Bias=7)
    FP8S: E4.F3.S (Bias=7)


    Semi-absent in my case:
    BFloat16: S.E8.F7
    Can be faked in software in my case using Shuffle ops.
    NVIDIA E5M2 (S.E5.F2)
    Could be faked using RGBA32 pack/unpack ops.

    So, you have identified the problem:: 8-bits contains insufficient
    exponent and fraction widths to be considered standard format.
    Thus, in order to utilize 8-bit FP one needs several incarnations.
    This just points back at the problem:: FP needs at least 10 bits.


    No immediate plans to add these later cases as (usually) I have a need
    for more precision than more exponent range. The main seeming merit of
    these formats being that they are truncated forms of the wider formats.


    No need to elaborate on the use-cases for Binary32 and Binary64, wide
    and varied.

    There is a growing clamor for 128-bit FP, too.


    Binary16 is useful for graphics
    probably,
    and audio processing.

    Insufficient data width as high quality Audio has gone to 24-bits
    {120 DBa S/N).

    You can call MP3 and other "phone" formats Audio, but please restrict
    yourself from using the term High Quality when doing so.

    Seemingly IEEE specifies it mostly for storage and not for computation, but for these
    cases it is good enough for computation as well.

    Binary16 is mostly sufficient for 3D model geometry, and for small 3D
    scenes, but not really for 3D computations or larger scenes (using it
    for transform or projection matrices or matrix multiply does not give acceptable results).

    Does work well for fast sin/cos lookup tables (if supported natively),
    say, because the error of storing an angle as 1/256 of a circle is
    larger than the error introduced by the 10 bit mantissa.

    I had also used it as the computational format in a lot of my neural-net experiments.

    I have seen NN used compressed FP formats where 0 uses 1-bit and
    1.0 uses but 2-bits. ...

    The 8-bit formats get a bit more niche; main use-cases mostly to save
    memory.

    Sometimes power also.

    FP8s originally exists because it was cheaper to encode/decode alongside FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has
    been re-added. I couldn't simply entirely replace FP8S back with FP8,
    partly as it seems my existing binaries depend on FP8S in a few places,
    and so simply replacing it would have broken my existing binaries.

    So, options were to either add some separate ops for FP8, or just live
    with using my wonky/non-standard FP8S format (or break my existing
    binaries). Ended up deciding to re-add FP8.

    Or don't do it that way.

    FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a
    few other things. The variant used in my case is seemingly fairly
    similar to that used by NVIDIA and PyTorch.

    If you are going to do an F8 make it compatible with OpenGL.

    Unlike the minifloat format described on Wikipedia (which had defined it
    as following IEEE 754 rules), it differs from IEEE rules in the handling
    of large and small values. No separate Inf/NaN range, rather the largest value serves as an implicit combined Inf/NaN, with the smallest value understood as 0.

    The main difference here between FP8 and FP8S being the location of the
    sign bit (putting it in the LSB initially allowed avoiding some MUX'ing
    when paired with FP8U).


    The re-added FP8 was instead overlapped with the unpack logic used for
    A-Law (even with the obvious difference...).

    The encoder-side logic for FP8 can be implemented by taking the FP8S
    output and reordering the bits (in an "assign"). Though, doing this on
    the decoder input side would not likely have saved anything (attempts to
    MUX on the input side seemingly tend to result in duplicating any LUTs
    that follows afterwards).

    Though, one could almost argue for combining all 4 cases into shared encoder/decoder modules (well, since at least 3/4 of the formats have
    the mantissa and exponent bits in the same place, FP8 being the odd one
    out; and A-Law being off-by-1 in terms of Bias).

    That combination is well served with a single 10-bit FP format.

    This appears to be similar to what NV and PyTorch used, and also
    overlaps with my handling of A-Law (though, the largest possible value
    of A-Law is understood as ~ 0.969).

    Where, A-Law has slightly higher precision, but is normally limited to
    unit range. Main use-case is in representing audio, but was sometimes
    also used when a small unit-range format was needed and precision wasn't
    a priority.

    For example, with slight fudging, it can be used to store
    unit-quaternions, among other things. It is basically accurate enough to store things like object orientations and 3D camera rotations. Though, generally, it is needed to normalize the quaternion after unpacking it.


    Ironically, for A-Law, my implementation and typical use differs from
    how it is usually stored in WAV files, in that in WAV files it is
    generally XOR'ed with 0x55, but this is an easy enough fix when loading
    audio data or similar.

    There is also u-Law, but u-Law isn't really a minifloat format.



    These formats can also be used for pixel data; though FP8U often made
    more sense for RGBA values (generally, negative RGBA isn't really a
    thing).

    However, pixel values may go outside unit range, so A-Law doesn't work
    for HDR pixel data. The use of FP8 or FP8S works, but gives lower
    quality than FP8U. Here, FP8U gives slightly better quality than RGB555
    over LDR range, whereas FP8 or FP8S is slightly worse for bright values
    (1 bit less accuracy between 0.5 and 1.0).


    For normal bitmap graphics, I am mostly using RGB555 at present though.

    There isn't yet a fast conversion path between RGB555 and floating-point formats, but, say:
    RGB5UPCK64 //Unpack RGB555 to 4x WORD
    PCVTUW2H //Packed Word to Half (1.0 .. 2.0)
    PADD.H //To adjust DC bias to 0.0 .. 1.0.
    ? PSTCM8UH //to FP8U (typical option for HDR RGBA pixel data)
    ? PSTCF8H //to FP8 (newly added)


    But, the crufty Word<->Half SIMD conversions exist mostly because it
    would have been more expensive to support "better" SIMD converters (the
    DC bias offset allowed doing the conversions via repacking the bits;
    whereas unit-range conversions would have required the more expensive
    path of adding the format conversion logic to the SIMD FADD units).

    Note that most of the SIMD format converters exist as applied use of bit-twiddling (and generally no rounding or similar, as rounding would
    add considerable amounts of cost here...).


    Though, cases needing fast conversion of pixel data between RGB555 and floating-point forms have been uncommon (most pixel math starting from
    RGB555 tends to remain on the integer side of things).


    If TKRA-GL were using HDR, most likely option here is:
    If HDR is used;
    The program binds an LDR texture.

    The GL backend can internally quietly generate an HDR version of the
    texture and use this instead; as opposed to trying to dynamically
    transform RGB555 or UTX2 into HDR during texel load.

    Though, another option would be to base it on the GL context:
    If the OpenGL framebuffer is HDR;
    All uploaded textures get converted to HDR formats as well.
    So, RGB555/RGBA8888/... -> FP8U, and DXT1/DXT5/BC6H/BC7 -> UTX3.

    ....



    For things like streaming PCM audio to the audio hardware, say:
    2x PSHUF.W+MOVxxD //Shuffle from Stereo to 1x/2x Mono
    PCVTSW2H //Packed Word to Half (2.0 .. 4.0)
    PADD.H //To adjust DC bias to -1.0 .. 1.0.
    PCVTH2AL //Convert Half to A-Law

    Where, the programs and API use 16-bit stereo PCM, and my audio hardware generally uses separate Left/Right A-Law for the loop buffers.

    A-Law was used mostly because:
    8-bit linear PCM tends to sound like garbage;

    It sounds more like computer automated speaking to me--Oh Wait--that
    does sound like Garbage:: Sorry !!

    16-bit PCM needs twice the Block-RAM (relative to sample rate);

    16-bit Audio is so 1990.....

    A-Law quality is closer to 16-bit PCM, at the same size as 8-bit PCM.
    So, I ended up designing the audio hardware to use A-Law.
    But, on a 50MHz CPU, the CPU is slow enough that one has reason to care
    about how many clock-cycles are used by the audio encoding (doing it in software was slow; so ended up doing it via SIMD).

    Generally, most audio mixing code has tended to use 16-bit PCM, as using Binary16 or Binary32 for front-end audio mixing is a bit of a novelty. Wouldn't be that hard to support in theory, would just need to be
    expressed via the WAVEFORMATEX structure (and, assuming the backend code
    was added to support raw floating-point PCM).

    The API does also support 8-bit PCM, but this is the worst case quality
    wise (combining both the initial poorness of 8-bit PCM with some
    additional information loss in the conversion to A-Law).
    Though, 8-bit PCM is still acceptable for use in sound-effects and
    similar. When mixed into a PCM buffer, typically amplitude and DC bias
    is all over the place.


    Had (early on) experimented with possible "small block" audio
    compression (and an ADPCM variant) for the audio hardware, but couldn't really get acceptable results. A-Law seemed to be the most reasonable compromise (in terms of encoding cost and "didn't sound like crap").

    Audio is supposed to sound like you were there listening to it live in
    a building designed for its acoustics.....but alas...

    While ADPCM can give OK quality relative to size, it was a rather poor
    fit for the use-case (it is much better as an "offline" audio storage format).



    These 8-bit floating-point formats are generally too poor in terms of
    quality to be used for direct computation in SIMD operations.

    So why support them ?

    Some stuff online implies that FP8 could be used as an intermediate computation format in neural nets, but my own past experiments in these
    areas implied that FP8 was insufficient (for good results, one seemingly needs around 7 or 8 bits for the mantissa).

    What several NN architectures do is to use a 256-bit word and then
    decode it into multiple F8 or F10 or F12 components using a Huffman
    coding scheme. 0 takes 1-bit 1.0 takes 2, leaving lots of bits for
    other mantissas. These were done to same memory BW not particulary
    size but raw aggregated BW.

    Granted, this was mostly with trying to use NN's for image processing
    tasks (which likely have higher accuracy requirements than, say, LLMs).

    However, FP8 can work OK for weights. Some experiments had used A-Law,
    but I can note that A-Law requires to add an extra scaling step before
    adding the bias and invoking an activation function (this could be
    avoided with FP8).

    For image-filtering NNs, seems to be better to work primarily using
    Binary16 and ReLU activation or similar.

    Though, the "approximate ssqrt" can work OK (where approximate ssqrt is roughly comparable to "tanh", but cheaper to calculate). The
    "approximate" part being that, by usual definition, one can leave off
    the Newton-Raphson iteration stages.

    Well, in a similar way to how, in graphics processing, it can sometimes
    be useful to redefine Binary16 divide "A/B" as roughly "A*(0x7800-B)"
    (if the speed of the divide matters more than the accuracy of the
    result).

    Though, generally makes sense to train the net with the same range and precision intended to run it (so, if it is going to be run as Binary16
    with approximate operators, it also needs to be trained using Binary16
    and approximate operators).


    Though, moderately annoying for "normal C on a desktop PC", as both
    Binary16 and FP8 are absent and will need to be faked in software.

    Run it as a GPGPU


    Ended up going with FP8 for a "packed multiply expanding" instruction:
    PMUL.F8H Rs, Rt, Rn
    Where, each FP8 in Rs and Rt is multiplied, and the result expands to a Binary16 element in Rn.

    Stuff like this falls out "for free" under VVM.

    Ended up not going with FMAC, as it is likely the cost and latency would
    have been a bit higher than I would like (and possibly higher than the "inline shuffle" experiment).

    The "PMUL.F8H" instruction was added with a 2-cycle latency, and seems
    to have a moderately low cost (no obvious impact on overall LUT costs). However, its logic is still complicated enough that I wouldn't want to
    try adding it as a 1-cycle operation.

    As one merit of using FP8, the 3-bit mantissa is small enough that the
    pair of mantissas can directly use LUT6 lookups (and most of the cost is likely along the exponent path).

    But, don't know if this would have much use out of potentially being
    useful for neural-net code.



    But, any thoughts?...

    Architecture is as much about what gets left out as what gets put in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Thu Aug 1 00:44:11 2024
    On Wed, 31 Jul 2024 18:31:35 -0500, BGB wrote:

    Binary16 is useful for graphics and audio processing.

    The common format for CG work is OpenEXR, and that allows for 32-bit and
    even 64-bit floats, per pixel component. So for example R-G-B-Alpha is 4 components.

    The 8-bit formats get a bit more niche; main use-cases mostly to save
    memory.

    Heavily used in AI work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Aug 1 17:55:10 2024
    MitchAlsup1 wrote:
    On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:

    No need to elaborate on the use-cases for Binary32 and Binary64, wide
    and varied.

    There is a growing clamor for 128-bit FP, too.

    Looking in a PDP-11 handbook from 1981, F (32-bit) and D (64-bit) formats
    there was a significant difference in operation speed for MUL and DIV.
    But since the 8087 people have got used to no significant difference
    (likely due to its internal use of FP80 for everything).

    With FP128 will there again be a significant difference in speed to
    FP62 or FP32 (including transcendentals)? Seems there would be because not every HW implementation is going to implement a full width multiplier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Thu Aug 1 23:09:19 2024
    On Thu, 1 Aug 2024 04:49:50 -0500, BGB wrote:

    IME, typically OpenGL HDR framebuffers used 4x Binary16.

    OpenGL is only for on-screen stuff.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Aug 1 23:31:38 2024
    On Thu, 1 Aug 2024 21:55:10 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:

    No need to elaborate on the use-cases for Binary32 and Binary64, wide
    and varied.

    There is a growing clamor for 128-bit FP, too.

    Looking in a PDP-11 handbook from 1981, F (32-bit) and D (64-bit)
    formats
    there was a significant difference in operation speed for MUL and DIV.
    But since the 8087 people have got used to no significant difference
    (likely due to its internal use of FP80 for everything).

    With FP128 will there again be a significant difference in speed to
    FP62 or FP32 (including transcendentals)? Seems there would be because
    not
    every HW implementation is going to implement a full width multiplier.

    One would expect FMUL and FDIV to be 4× slower in 128 than in 64, or
    to consume 4 64-bit FMACs simultaneously to not be 4× slower; here
    they would probably still be 1.5× slower. FADD would be 1 cycle
    later.

    One would expect transcendentals to be 5×-6× slower; 2× when using
    4 FMAC units simultaneously.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Fri Aug 2 06:43:25 2024
    On Fri, 2 Aug 2024 00:47:56 -0500, BGB wrote:

    There are shadows, but given there are not any obvious gradients in
    the shadows; they could likely have been pulled off using the
    stencil buffer.

    There are offline renderers that are built on similar principles to
    video-game engines. That’s not the same as using OpenGL for offscreen rendering.

    I know because I did some experiments along these lines when I was
    learning Android programming. OpenGL engines tend not to prioritize
    efficient retrieval of rendered images back into program-local
    offscreen buffers.

    Slightly older, same company: https://www.youtube.com/watch?v=MfT2MaaukrQ&list=PLV4Ztn9euy7TJJrTgY5jgswEo_GY0ZHqj

    The show is ReBoot...

    I think, back in those early days, OpenGL was seriously considered for
    use as an offline rendering engine.

    Not any more, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Fri Aug 2 02:28:09 2024
    On Thu, 1 Aug 2024 19:17:00 -0500, BGB wrote:

    On 8/1/2024 6:09 PM, Lawrence D'Oliveiro wrote:

    On Thu, 1 Aug 2024 04:49:50 -0500, BGB wrote:

    IME, typically OpenGL HDR framebuffers used 4x Binary16.

    OpenGL is only for on-screen stuff.

    You can use it for offline rendering as well ...

    Not for serious use. Unless you can give some examples?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Fri Aug 2 05:53:48 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    With FP128 will there again be a significant difference in speed to
    FP62 or FP32 (including transcendentals)? Seems there would be because not every HW implementation is going to implement a full width multiplier.

    The only major architecture I'm aware of that uses FP128, POWER,
    chose to use their decimal FP unit do do it on the side.

    This makes multiplication _really_ slow, unfortunately.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 3 00:07:15 2024
    On Fri, 2 Aug 2024 03:56:35 -0500, BGB wrote:

    For batch rendering, it doesn't really need to be efficient though.

    Of course it does. Ray-traced renderers can compute hundreds or thousands
    of rays per pixel, taking anywhere from minutes to hours per frame. This
    is how you produce those cinema-quality 4K (or even 8K) frames. Anything
    that can shave a little bit off the time for a single ray computation can
    very quickly add up.

    General strategy would be to setup a context, render to the context, and
    then fetch the rendered image using glReadPixels or similar.

    The "glReadPixels()" function is not exactly new...

    Almost verging on useless, in my (albeit limited) tests. OpenGL, Vulkan
    and the like are targeted towards on-screen rendering, and that’s what you should stick to using them for.

    Don’t confuse this with the fact that programmable GPUs are adaptable to
    run SIMD-oriented compute APIs like OpenCL/SYCL (and their proprietary
    rivals), and that these are commonly used to implement high-quality
    offline renderers like those I mentioned above. These have nothing to do
    with OpenGL.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Chris M. Thomasson on Sat Aug 3 01:35:26 2024
    On Fri, 2 Aug 2024 17:18:17 -0700, Chris M. Thomasson wrote:

    Taking a screen shot? glReadPixels is okay, right?

    That means grabbing the whole screen, regardless of what else it might be showing, assuming your window it not occluded, and it further assumes you
    have a screen to grab from. This kind of precludes running the renderer as
    a batch, background process.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 3 05:23:38 2024
    On Fri, 2 Aug 2024 21:23:36 -0500, BGB wrote:

    Around the time of OpenGL 3.0, they added FBO's (FrameBuffer Objects),
    which allowed rendering directly into a texture without needing to use glReadPixels followed by glTexImage2D.

    That doesn’t help, because those FBOs are intended to be used as textures
    for further rendering on-screen.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 3 05:33:01 2024
    On Fri, 2 Aug 2024 21:14:33 -0500, BGB wrote:

    On 8/2/2024 8:35 PM, Lawrence D'Oliveiro wrote:

    On Fri, 2 Aug 2024 17:18:17 -0700, Chris M. Thomasson wrote:

    Taking a screen shot? glReadPixels is okay, right?

    That means grabbing the whole screen, regardless of what else it might
    be showing, assuming your window it not occluded, and it further
    assumes you have a screen to grab from. This kind of precludes running
    the renderer as a batch, background process.

    The glReadPixels call doesn't grab an image from the OS desktop, but
    rather from one of the internal framebuffers associated with the OpenGL context.

    Still lousy performance, though. That is simply not considered a serious
    usage scenario for OpenGL. And not for any other on-screen 3D API, as far
    as I’m aware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 3 05:32:15 2024
    On Fri, 2 Aug 2024 20:54:47 -0500, BGB wrote:

    On 8/2/2024 7:07 PM, Lawrence D'Oliveiro wrote:

    On Fri, 2 Aug 2024 03:56:35 -0500, BGB wrote:

    For batch rendering, it doesn't really need to be efficient though.

    Of course it does. Ray-traced renderers can compute hundreds or
    thousands of rays per pixel, taking anywhere from minutes to hours per
    frame. This is how you produce those cinema-quality 4K (or even 8K)
    frames. Anything that can shave a little bit off the time for a single
    ray computation can very quickly add up.

    You don't do raytracing in OpenGL...

    That is not the point of using it.

    Precisely what I have been trying to say.

    Similarly, not everything needs raytracing.
    For many uses, rasterization is fine.

    There are renderers that are based on video-game engines. And they can
    indeed use GPU-based compute APIs, as I mentioned previously. But they
    still don’t use OpenGL for rendering.

    For a lot of shows, you don't even really need 3D.

    Funny, that. It is quite common to create 2D-style animations with 3D
    software nowadays.

    One might instead go the other direction, and use a 3D renderer with a cel-shading filter to try to make it fit better with the anime
    characters ...

    It takes a bit more than cel-shading to get a good 2D look with 3D
    software.

    <https://www.youtube.com/watch?v=pKmSdY56VtY>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Sat Aug 3 11:47:01 2024
    Lawrence D'Oliveiro wrote:
    On Wed, 31 Jul 2024 18:31:35 -0500, BGB wrote:

    Binary16 is useful for graphics and audio processing.

    The common format for CG work is OpenEXR, and that allows for 32-bit and
    even 64-bit floats, per pixel component. So for example R-G-B-Alpha is 4 components.

    The 8-bit formats get a bit more niche; main use-cases mostly to save
    memory.

    Heavily used in AI work.

    The nicest property of fp8, as seen from a GPUs point of view, is that arbitrary operations can be seen as texture map lookups. I don't think
    that's how they are implemented but an 8x8->16 FMUL would only need a
    few very small lookup tables, probably doable even on a regular CPU with 16-element permute operations.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sat Aug 3 11:40:23 2024
    MitchAlsup1 wrote:
    On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:

    So, say, we have common formats:
       Binary64, S.E11.F52, Common Use
       Binary32, S.E8.F23, Common Use
       Binary16, S.E5.F10, Less Common Use

    But, things get funky below this:
       A-Law: S.E3.F4 (Bias=8)
       FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
       FP8U: E4.F4 (Bias=7)
       FP8S: E4.F3.S (Bias=7)


    Semi-absent in my case:
       BFloat16: S.E8.F7
         Can be faked in software in my case using Shuffle ops.
       NVIDIA E5M2 (S.E5.F2)
         Could be faked using RGBA32 pack/unpack ops.

    So, you have identified the problem:: 8-bits contains insufficient
    exponent and fraction widths to be considered standard format.
    Thus, in order to utilize 8-bit FP one needs several incarnations.
    This just points back at the problem:: FP needs at least 10 bits.

    I agree that fp10 is probably the shortest sane/useful version, but
    1:3:4 does in fact contain enough exponent and mantissa bits to be
    considered an ieee754 format.

    3 exp bits means that you have 6 steps for regular/normal numbers, which
    is enough to give some range.

    4 mantissa bits (with hidden bit of course) handles zero/subnormal/normal/infinity/qnan/snan.

    Afair the absolute limit is two mantissa bits in order to differentiate between Inf/QNaN and SNaN, as well as two exp bits, so fp5 (1:2:2)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)