@ghostlander said:
@Wolf0 350 * 32 / 28 = 400KH/s
That’s what I have now. Maybe a little bit more. The primary reason for downvolting and downclocking is power consumption indeed. HD7990 with the default 1000/1500 clocks @ 1.2V eats too much power (400W+) and air cooling cannot keep it within 85C. Now it’s 250W @ 1.0V and gets within 70C just fine.
Modern high end GPUs have excessive memory bandwidth for NeoScrypt with quality kernel optimisations. That’s not a problem. SHA-256d ASICs don’t consume it at all while Scrypt ASICs may also be compute bound depending on eDRAM size/speed and number of execution units.
I can patch in xintensity and/or rawintensity, though I doubt it’s going to make a big difference.
It’s not quite the excessive memory bandwidth - the bandwidth isn’t used much - ESPECIALLY not in your kernel (at least the last one I saw.) The reason is simple - you can only really run a limited number of work-items before the CUs can’t do it in parallel and the extra wavefronts get queued up for execution after the current ones complete. Because of this, only the waves currently in flight are going to be accessing memory at all. NeoScrypt is NOT memory intensive - you can’t even exhaust GPU memory before you exhaust compute if you run NeoScrypt with parallel SMix() calls - at least not on 7950 (R9 280) or 7970 (R9 280X). I know because I tried. Now, keeping in mind you can only run as many work-items in parallel as your compute resources will support, you now factor in the fact NeoScrypt doesn’t really do a ton of lookups compared to the level of computation, AND that said lookups are random and not sequential, and you get the result that memory bandwidth isn’t going to help you really at all. In your kernel particularly - not only do you have few waves in flight, but you’re running NeoScrypt in such a way that the SMix() calls are sequential! Again, this is last I looked, and GitHub is currently down. Because of this, each wavefront is actually doing half the memory lookups it could be (at a time) because it’s sequentially doing the ChaCha SMix() and the Salsa SMix() - they aren’t done in parallel, so the memory bus is less utilized.
A smaller issue you have is code size - weighing in at 121,264 bytes, that’s fully 3.7x the GCN code cache. These three extra fetches will hurt you - it hurt me pretty badly in my main Neoscrypt loop until I finally coaxed the compiler to not duplicate code in it. Additionally - your XORs in FastKDF aren’t doing you any favors - the design of FastKDF makes it a bitch to do aligned copies, but not impossible. For now, however, I’ve just done XOR into destination in an aligned fashion, as I’ve yet to make the non-byte-wise copies and other XORs play nice and FINALLY coax the AMD OpenCL compiler to do away with the scratch registers, which I believe are costing me quite a bit. Your XZ variable is also one ulong16 larger than it needs to be for the main function - it’s only needed inside FastKDF. I haven’t checked the disassembly to see if this hurts you, but it could be causing undue register usage as because it isn’t local to the FastKDF function (and the AMD OpenCL compiler is very, very stupid), it might just be using those registers and not reusing them in the main portion of NeoScrypt.
There are other smaller bits, like your little if/else branch where you XOR into the password isn’t needed (and you know how GPUs hate branching, I’m sure) - but I don’t think those are hurting you too much, except maybe cumulatively.
EDIT: Oh, almost forgot to respond to the comment about xintensity and rawintensity (hereafter xI and rI, respectively). The options intensity and xI are really convenience options - the only one you REALLY need is rI, as any value passed to the others has an equivalent with rI. Now, the reason why I believe xI and/or rI will make a small, yet significant difference is because of scheduling. If you load up the compute units of your card manually, this takes a bit of the load off the scheduler, which MAY do something sub-optimal - for example, 2^14 is 16,384 - 64 work-items per wavefront and 28 CUs per GPU on 7950 (and R9 280) - if the host code doesn’t enqueue fast enough, or for some reason there’s some kind of stall, you could end up running (16,394 / (28 * 64)) = 16,384 / 1,792 = 9 waves with the remaining ~.143 scheduled to run alone - leaving most of the CUs idle for that run. It’s not like this is extremely likely to occur often, but I’m thinking it likely does occur, because I’ve seen improvements by using xI instead of regular intensity with NeoScrypt before now. Because of this, while I do not believe the importance to be major, I do believe it to be substantial.