@ghostlander said:
@Wolf0 I get it. I’ve also rewritten it. The code quoted is plain bytewise, though old VLIW GPUs like it for some arcane reason.
Odd. I got my 6970 today, so I should be able to work on Cayman in a while.
@ghostlander said:
@Wolf0 I get it. I’ve also rewritten it. The code quoted is plain bytewise, though old VLIW GPUs like it for some arcane reason.
Odd. I got my 6970 today, so I should be able to work on Cayman in a while.
@ghostlander said:
@Wolf0 https://github.com/ghostlander/nsgminer/blob/692e2ef2946229cf057dd006c8e85c8674f0342f/neoscrypt.cl#L713
It’s executed 64 times per hash. The final XOR outside the loop is less important.
@Wolf0 said:
Unless you mean something you’ve not pushed, in which case never mind. If you have, then nice - my trick with the aligning the XOR worked out for you.
Well, I added it to my beta 10 days ago. You have mentioned to do bytewise XOR in uints, I have vectorised it which is also fine. Not uploaded to GitHub yet, but quite a few people use it right now. It’s well improved over the previous release in performance and compatibility. I see only a 5% decrease while switching from 14.6 to 15.7 drivers. It was much worse before (https://bitcointalk.org/index.php?topic=712650.msg13585416#msg13585416).
OH, lol, yes, that is good, but that was not what I meant! This line:
[code]
neoscrypt_bxor(&Bb[bufptr], &T[0], 32);
[/code]
I’m saying I did this operation using uints.
@ghostlander said:
@Wolf0 I have optimised the most important XOR in FastKDF already. It was a bottleneck to do it bytewise on GCN. 120K kernel size isn’t very large because Salsa and ChaCha separately fit the code cache and FastKDF has more important issues like memory alignment. I’ll try to optimise it better.
Which XOR would that be? I feel like I’m derping and missing something obvious, but I see the ending XOR with the if/else branch outside the loop, and the XOR inside the loop which is done with a call to neoscrypt_bxor()… I just looked at your current git again, double-checked this, then read the neoscrypt_bxor() function again - it’s still bytewise. Unless you mean something you’ve not pushed, in which case never mind. If you have, then nice - my trick with the aligning the XOR worked out for you.
Anyways, you seem to be working from the outside in, rather than from the inside out, when it comes to the optimization of the code - the “outside” being the portions with less time spent, and the “inside” being the opposite. You really might want to look into SMix() - that’s where you really can gain hashrate.
@wrapper said:
I like the idea on optimising on “power efficiency”, not “speed”. ;)
They are almost always one in the same in the GPU arena. If I have shitty, slow code, it leaves portions of the GPU unused, or at least under-utilized, causing the lower power consumption people notice. However - if these resources are used well, then the hashrate goes up far more than power does - I actually have records from my really old X11 optimizations to show this, as well as exact percentages taken from runs of the (then) stock X11 shipping with SGMiner and mine on Freya.
I still haven’t patched in more precise intensity, but I have managed to improve upon my 01/17/2016 record by around 2.657% - 425kh/s on 7950 at 1050/1500 now. To compare with the 7990, I also ran a test at 1000/1500 - 410kh/s to 411kh/s. I’ll do some tests on power draw later.
My two cents on any AMD driver x such that 15.7 > x < 16.1 is TERRIBLE for mining (and probably ANYTHING except gaming.) It might be okay with certain algos, I don’t know, but the ones I tried? The OpenCL compiler just butchered them. Using bin files or GCN assembly to generate your own bin files should be okay. 16.x I haven’t evaluated yet.
@ghostlander said:
@Wolf0 350 * 32 / 28 = 400KH/s
That’s what I have now. Maybe a little bit more. The primary reason for downvolting and downclocking is power consumption indeed. HD7990 with the default 1000/1500 clocks @ 1.2V eats too much power (400W+) and air cooling cannot keep it within 85C. Now it’s 250W @ 1.0V and gets within 70C just fine.
Modern high end GPUs have excessive memory bandwidth for NeoScrypt with quality kernel optimisations. That’s not a problem. SHA-256d ASICs don’t consume it at all while Scrypt ASICs may also be compute bound depending on eDRAM size/speed and number of execution units.
I can patch in xintensity and/or rawintensity, though I doubt it’s going to make a big difference.
It’s not quite the excessive memory bandwidth - the bandwidth isn’t used much - ESPECIALLY not in your kernel (at least the last one I saw.) The reason is simple - you can only really run a limited number of work-items before the CUs can’t do it in parallel and the extra wavefronts get queued up for execution after the current ones complete. Because of this, only the waves currently in flight are going to be accessing memory at all. NeoScrypt is NOT memory intensive - you can’t even exhaust GPU memory before you exhaust compute if you run NeoScrypt with parallel SMix() calls - at least not on 7950 (R9 280) or 7970 (R9 280X). I know because I tried. Now, keeping in mind you can only run as many work-items in parallel as your compute resources will support, you now factor in the fact NeoScrypt doesn’t really do a ton of lookups compared to the level of computation, AND that said lookups are random and not sequential, and you get the result that memory bandwidth isn’t going to help you really at all. In your kernel particularly - not only do you have few waves in flight, but you’re running NeoScrypt in such a way that the SMix() calls are sequential! Again, this is last I looked, and GitHub is currently down. Because of this, each wavefront is actually doing half the memory lookups it could be (at a time) because it’s sequentially doing the ChaCha SMix() and the Salsa SMix() - they aren’t done in parallel, so the memory bus is less utilized.
A smaller issue you have is code size - weighing in at 121,264 bytes, that’s fully 3.7x the GCN code cache. These three extra fetches will hurt you - it hurt me pretty badly in my main Neoscrypt loop until I finally coaxed the compiler to not duplicate code in it. Additionally - your XORs in FastKDF aren’t doing you any favors - the design of FastKDF makes it a bitch to do aligned copies, but not impossible. For now, however, I’ve just done XOR into destination in an aligned fashion, as I’ve yet to make the non-byte-wise copies and other XORs play nice and FINALLY coax the AMD OpenCL compiler to do away with the scratch registers, which I believe are costing me quite a bit. Your XZ variable is also one ulong16 larger than it needs to be for the main function - it’s only needed inside FastKDF. I haven’t checked the disassembly to see if this hurts you, but it could be causing undue register usage as because it isn’t local to the FastKDF function (and the AMD OpenCL compiler is very, very stupid), it might just be using those registers and not reusing them in the main portion of NeoScrypt.
There are other smaller bits, like your little if/else branch where you XOR into the password isn’t needed (and you know how GPUs hate branching, I’m sure) - but I don’t think those are hurting you too much, except maybe cumulatively.
EDIT: Oh, almost forgot to respond to the comment about xintensity and rawintensity (hereafter xI and rI, respectively). The options intensity and xI are really convenience options - the only one you REALLY need is rI, as any value passed to the others has an equivalent with rI. Now, the reason why I believe xI and/or rI will make a small, yet significant difference is because of scheduling. If you load up the compute units of your card manually, this takes a bit of the load off the scheduler, which MAY do something sub-optimal - for example, 2^14 is 16,384 - 64 work-items per wavefront and 28 CUs per GPU on 7950 (and R9 280) - if the host code doesn’t enqueue fast enough, or for some reason there’s some kind of stall, you could end up running (16,394 / (28 * 64)) = 16,384 / 1,792 = 9 waves with the remaining ~.143 scheduled to run alone - leaving most of the CUs idle for that run. It’s not like this is extremely likely to occur often, but I’m thinking it likely does occur, because I’ve seen improvements by using xI instead of regular intensity with NeoScrypt before now. Because of this, while I do not believe the importance to be major, I do believe it to be substantial.
@wrapper said:
Don’t worry about it Wolf, it’s a specialised software, so unlikely to be targeted any way.
As you know with security, just being able to cause a crash, maybe by a spear fished or whatever, could let someone in. That’s unlikely to work if the miner is on separate secure PCs. But we are talking about decentralised “mining money”, so it should be considered a target
As you probably understand, my main aim is to make the software open, not accuse you of inserting a trojan…
P.S. Thanks for the other info, which areas of hardware are secured and what the GPU can and can’t do is an interesting area…
Not a problem; and don’t worry - I didn’t take it personally. I’m just saying I personally would feel comfortable (and I don’t see why anyone wouldn’t) in running (for example) Kachur’s X11 GPU binaries with my own host code - even though it was found his miner was exploiting pools to look faster in the case of XMR (I think it was XMR.)
On the other paw, I disassembled and modified the GCN assembly, then reassembled it before running it - but this wasn’t for security purposes; I simply wanted to improve the speed of the X11 code.
I’m rather surprised - but not totally confused - by the results of my test. The clocks 850/1250 on 7950 result in a hashrate over 350kh/s on 7950 with my code. I wondered if there was a reason besides the heat (and therefore probably downvolting) that caused Ghostlander to pick those rather low clocks.
Still not quite the fastest - I can reach just under (sometimes slightly above) 430kh/s on a 7950. Ghostlander’s latest result (his v7 Beta kernel) that he posted here: https://bitcointalk.org/index.php?topic=712650.msg13611456#msg13611456 is 400kh/s on a 7990.
Now, there IS a big discrepancy with our clock speeds - I’m running 1100/1500 while he’s running 850/1250 - but there’s also a discrepancy on compute unit count: a 7950’s core has 28 compute units (CUs), while a 7970’s core (of which the 7990 has two of) has 32 CUs each. Because of this, and the face NeoScrypt as used in Feathercoin’s proof-of-work is compute-bound on current GPUs - NOT memory-bound, as is a common misconception, or at least, a good implementation should not be - on a 7970 (or R9 280X), NeoScrypt should have a 14.286% boost in hashrate over a 7950 (or R9 280).
For fun, before I lower my clocks and test again, I’ll estimate based on core clock speeds: My kernel should lose 29.412% speed at 850 core (I’ll drop the memclock, too, but I don’t think it’ll have much of an effect at all), which means my new hashrate on 7950 should turn out to be 303kh/s.
@ghostlander said:
I recall SGminer recommends to use xintensity. It isn’t guaranteed to deliver power of 2 thread numbers which is a must for my kernel. The classic intensity results in (2 ^ intensity) thread numbers which is fine.
I highly disagree that this is “fine” - assuming that means optimal. However, I haven’t patched in any xintensity support yet, so I’m still forced to use the very coarse-grained classic intensity as well.
@wrapper said:
I don’t want to give any ideas, but you are perhaps being over ambitious, the NSA might just gradually leak a key?
You are correct it would be harder to exploit than a plain binary, but I have seen and run other code in the GPU .
I have a really hard time posting on this new forum - often the little pop-up for creating/editing a message doesn’t come up. I don’t know about Nvidia, but I know AMD, and specifically AMD GCN, in quite deep detail. Leak what key? What kind of key? The only thing you could possibly leak off the GPU would be the winning nonce, and even that has zero value - the open source miner code will submit it to the pool or to the network (in the case of solo mining) immediately.
But let’s pretend I could access every possible bit of the GPU - I have DMA access to memory, MAYBE, over the PCI-E bus. That’s about the fuck it - now, ehhh… maybe I could clobber some important OS structures like the EPROCESS (executive process block) doubly linked list in Windows, but that’s IF I can (ab)use DMA to read/write to kernel mode, which I kinda doubt, ever since that vuln was found that abused DMA with FireWire devices (I think) to access privileged memory. This would allow me to crash the OS - basically a denial of service - but it could be fixed with a reboot, and then they wouldn’t run the miner again.
Now, this is probably impossible since I can’t access the PCI-E capabilities of the card from a compute kernel - but even if I could, I wouldn’t have to in order to crash Windows. The AMD driver is so bad that I could lock up or outright cause a bug check (BSoD) in Windows simply by writing all over invalid GPU memory from the kernel - and this will happen on Linux too; these users are not exempt. I’ve done it by accident.
In any of these cases, the user reboots, and they don’t run that miner or GPU binary again, because it produces quite obvious undesired operation that interferes with their normal use of the system. There’s no wiggle room for malice here.
I get it - you’re using the a (usually valid) point, and you don’t see how it doesn’t quite apply the same way.
To understand why it’s not really much of a risk at all requires understanding the difference between the host and device when it comes to GPU acceleration. The host is your PC, the device its GPU. For all intents and purposes, they are completely seperate computers connected over the PCI-e bus. Now, if you ran a binary-only executable on the host, you’re absolutely correct in that this is a security risk and it can abuse your system. In the device’s case, however… well, let’s pretend I’m malicious:
I want to do something bad, but here’s how it has to happen - the kernel(s) called must be called from an open source application. The memory it uses must be allocated by the same, and passed to the kernel(s) that need it by the same. It also is expected to give a certain correct output (finding correct nonces) - if it does not, the game is up; obviously no miner will sit and run a kernel that doesn’t find any shares! Now, maybe I compute something extra for me on the side, eh? How do I get the data off the GPU? If I try using the open source application, it’s obvious. I have no access to the PCI-e shit through compute, and even if I did, the worst I could do is maybe DMA it to somewhere in memory… which I then cannot retrieve undetected because of the open source host code. I have no options.
@wrapper said:
Just a note on what is the future for “improving mining efficiency”, I have seen that a new hardware device type called a “Crypto- Accelerator” has just come out.
It is used accelerate encryption and decryption of messages, so companies can check all the encrypted traffic going in or out the company network, or to speed up big sites where everything is now ssl…
It’s seems obvious that such a device has the potential to be leveraged as a programmable generic ASIC and would only need software (sgminer) to send and receive the encryption / decryption results to a crypto-accelerator instead of what is handled currently by the neoscrypt.cl GPU kernel.
Not the case. The algos used in Neo are NOT likely to end up in a crypto-accelerator, and on top of that, it’s not magic - you’re not going to get it to ever do Neoscrypt, you’d have to have it do the pieces of Neoscrypt and then do the rest on CPU/GPU. For example, have the accelerator do Blake - but still, this wouldn’t have too much of an effect even if you did the core algos, I think - you’ve got the memory usage to contend with.
I’d love to hear the rationale behind bins being a security risk - they’re not “obviously” a risk. As a matter of fact, it’s a LOT better for the end user - because plenty of end users either don’t know how to find and install specific versions of AMD drivers and/or install only the latest. Every other or maybe every third driver, AMD manages to utterly destroy what little quality their OpenCL compiler has - and sometimes fails to fix it for a few versions. Now, the binary helps because it was compiled by the CODER’S driver, meaning less on the user.
Oh, also, because of aforementioned terrible OpenCL driver, I now often drop to AMD GCN assembly language - what source would you have me release?
…but I’ve not been here in a while, either.
There isn’t a standard location where libraries go.
Yes, there is - default search path goes to system32 if it can’t find the DLL anywhere else.
Hmm. It is a library of sorts. Unfortunately Windows is a joke and does not even have libraries you can install separately from binaries the right way. Dammit.
Yes it does, it’s called a DLL.
Well it is free but you need people to donate a card, so that’s not really free :) No I guess you understand the reason for a pledge…
Regarding Nvidia cards, sure I support their products. They have nice cards some are clearly overpriced but not all. The 750ti was pretty cheap and pretty good for mining (and the forthcoming gtx960 shouldn’t be too expensive with performance probably better than the 750ti).
Regarding opencl. Wolf0 who developped neoscrypt kernel for sgminer5 claims he has a kernel (private) doing 600kh/s on 290x may-be you could try to reach that speed with yours and release it for free, I guess you already have the material (Wolf0 won’t be happy if he reads me ;) )
I don’t just have claims, I has proof :P
But anyway, I’m not unhappy - I enjoy competition, it keeps me sharp.
Nice to know that the interest in improving the kernel didn’t go away when Wolf said he was keeping his improvements to himself…
I’m not in any way faulting him for that decision, but that doesn’t mean that I have to like it either… ;)
BTW, I got into X11 mining, and guess who showed up as the top kernel writer… lol
Haha, yeah, my bins leaked.
Also, this:
switch(mod % 4)
is just bad.
Can’t wait to get it. Currently I’m getting just 40k out of my Goeforce 750Ti
It does 3x that, I think.
FPGA before anyone even makes a CUDA miner for NeoScrypt? It’s possible but doubtful.
CUDA miner is done but unreleased. Not by me, I don’t have it.