ccminer icon indicating copy to clipboard operation
ccminer copied to clipboard

lyra2v2 on titan V (the new volta card)

Open eruditej opened this issue 7 years ago • 45 comments

I am unsure what the issue is exactly. tpruvot 2.2.3 gives me twice the hashrate of klaust even though you are building sm_70 w/ cuda 9.1

using the latest drivers, tpruvot gives me about 90-100 MH/s klaust gives me 40-42 MH/s

The skew is so large, im suspecting something simple around the profile of the card specific code.

Also, thank you for making your code so much cleaner. I can actually load it in VS and build vs tpru where I have to redo the entire compat system to make it build on anything newer than VS13.

eruditej avatar Dec 22 '17 20:12 eruditej

That's wierd. I would have expected at least 1080Ti speed, which is 60-65 or something. Did you manually set the intensity?

KlausT avatar Dec 22 '17 20:12 KlausT

Tried your stock 8.17: avg 41MH/s Built one as well with the same results. Same results on your 8.16. Wondered if cuda9 was the issue, tried cuda8 builds, same results. Wondered if the changes for Titan cards were the culprit, went back to 8.15. Same results.

Default options only, no modifications to the code.

I am thinking it is maybe missing a match somewhere and doing too few threads in lyra2v2, I am reading the output from here: https://www.reddit.com/r/nvidia/comments/7iq5tk/just_got_a_titan_v_this_morning_ask_me_to/dr1y7x1/

So over a 1000 more cores than a 1080ti. Looking at code now to figure out if threads per block is getting underset due to something something CUDA_ARCH 700 or something causing an issue.

Is there a way to see TPB at runtime printed out?

eruditej avatar Dec 22 '17 22:12 eruditej

You can actually see the threads per block in the code. When you have a kernel call like this for example: lyra2v2_gpu_hash_32_2 <<< grid4, block4, 0, gpustream[thr_id] >>> (threads); then it's the block4 value (32 in this case).

Or you could add two lines of code to each kernel:

if(thread == 0)
  printf("%d %d\n", blockDim.x, blockDim.y);

KlausT avatar Dec 22 '17 22:12 KlausT

In the lyra code, I wanted to make a quick bisection. I tried to force the 1080 code in lyra2v2.cu: /** #if CUDA_ARCH >= 500 */ then i commented out the corresponding else and modified lyra2v2_cpu_hash_32 accordingly.

It failed to build, there isn't any condition around #include cuda_lyra2v2_sm3.cuh that keeps it from not being included.

After I commented out the sm3.cuh header include, build succeeded with the forced >=500 code on.

So I suspect something is wonky in the macro processing and this has been defaulting to lower than 128 TPB for everything.

I tested the new binary, but still at 40MH/s, so this wasn't the issue. But I am still looking.

The ccminer_nanashi code results in i=19 default and 524288 threads at runtime for 99-100MH/s.

I am trying to find the printfs where that code displays that info and add that to my local build of klaust.

eruditej avatar Dec 22 '17 23:12 eruditej

after adding some applog code CUDA_SAFE_CALL(cudaMalloc(&d_hash2, 16ULL * 4 * 4 * sizeof(uint64_t) * throughputmax)); CUDA_SAFE_CALL(cudaMalloc(&d_hash, 8ULL * sizeof(uint32_t) * throughputmax));

	+/** Add diagnostic info like the other ccminer variants */
	+applog(LOG_INFO, "%u cuda threads", thr_id, throughput);

C:\miners\ccminer-817-cuda91-x64>ccminer.exe -a lyra2v2 --benchmark ccminer 8.17-KlausT (64bit) for nVidia GPUs Compiled with Visual Studio 2015 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork CUDA support by Christian Buchner, Christian H. and DJM34 Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

[2017-12-22 17:55:02] NVML GPU monitoring enabled. [2017-12-22 17:55:02] 2 miner threads started, using 'lyra2v2' algorithm. [2017-12-22 17:55:02] 524288 cuda threads [2017-12-22 17:55:03] 524288 cuda threads [2017-12-22 17:55:03] GPU #1 Found nonce 804c4e29 [2017-12-22 17:55:03] GPU #1 Found nonce 8065d03e [2017-12-22 17:55:03] GPU #1: TITAN V, 41.24 MH/s [2017-12-22 17:55:03] Total: 41.24 MH/s

So, it says I have as many threads as the nanashi build, I'll have to keep poking.

eruditej avatar Dec 23 '17 00:12 eruditej

By the way, if you can figure out how to use the Nvidia Visual Profiler, then we can maybe find out which CUDA kernel is making problems, if any. With my GTX1070 it looks like this:

42.1% cubehash256_gpu_hash_32
35.7% lyra2v2_gpu_hash_32_2
 7.2% blakekeccak256_gpu_hash_80
 5.5% skein256_gpu_hash_32
 4.7% lyra2v2_gpu_hash_32_1
 2.8% lyra2v2_gpu_hash_32_3
 1.8% bmw256_gpu_hash_32

KlausT avatar Dec 23 '17 01:12 KlausT

I fully merged this change into your code for cubehash256 https://github.com/Myrinia/ccminer/commit/9e1c1739d5d9d3fb6ff4dca76fdbeee0a9dcf047

Result: slightly faster (and cleaner) but still almost less than half as fast as tprot. Looking at the profiler but these tools are new to me.

So, by order of elimination, it isn't cubehash256

eruditej avatar Dec 23 '17 04:12 eruditej

I am still working on this, but this is all new code to me. Myrinia looks to be cleaning up the lyra2 (not v2) code with the same style used to clean cubehash. I'm still assuming that at half hash rates, it has to be something simple and easily overlooked. I started looking at the TBx/gpu_hash in lyra2v2. But none of my tests made a difference so far.

You go with 128 while tpru defaults to 32, but you calculate the _2 differently and you hard code 32 as the launch bound for _2. I tried redoing it his way, but it made no difference.

Everytime I try profiling I get out of memory errors, so I may have to take an hour to get used to the profiler.

That, or I may need to add more applog calls to see what is happening when/where.

I want a good clean codebase that was 9.1 clean (like yours) and builds in VS15/17 so I could test some Tensor core matrix math to see if sped up anything.

eruditej avatar Dec 23 '17 19:12 eruditej

There's also a command line profiling tool: nvprof --timeout 60 D:\code\ccminer\Release\x64\ccminer -a lyra2v2 --benchmark

KlausT avatar Dec 23 '17 19:12 KlausT

==7252== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   68.11%  38.9521s      4602  8.4642ms  8.4298ms  10.407ms  lyra2v2_gpu_hash_32_2(unsigned int)
                   18.10%  10.3545s      9204  1.1250ms  1.1234ms  1.1280ms  cubehash256_gpu_hash_32(unsigned int, unsigned 
int, uint2*)
                    3.89%  2.22525s      4602  483.54us  482.76us  501.64us  blakeKeccak256_gpu_hash_80(unsigned int, unsigned 
int, unsigned int*)
                    2.63%  1.50461s      4602  326.95us  320.93us  336.16us  skein256_gpu_hash_32(unsigned int, unsigned int, 
unsigned __int64*)
                    2.44%  1.39550s      4602  303.24us  301.16us  307.30us  lyra2v2_gpu_hash_32_1(unsigned int, uint2*)
                    2.43%  1.39045s      4716  294.83us  1.0560us  23.325ms  [CUDA memcpy DtoH]
                    1.53%  876.38ms      4602  190.43us  180.29us  202.08us  lyra2v2_gpu_hash_32_3(unsigned int, uint2*)
                    0.85%  488.74ms      4602  106.20us  103.84us  112.93us  bmw256_gpu_hash_32(unsigned int, unsigned int, 

That gpu_hash32_2! I was close in my previous post just going through the bisection process.

eruditej avatar Dec 23 '17 20:12 eruditej

[2017-12-23 16:14:40] GPU #0 Found nonce 3b5436b7 [2017-12-23 16:14:40] GPU #0: TITAN V, 85.48 MH/s [2017-12-23 16:14:40] GPU #0 Found nonce 3d85beef [2017-12-23 16:14:40] GPU #0: TITAN V, 88.85 MH/s [2017-12-23 16:14:40] GPU #1 Found nonce c032dcf5 [2017-12-23 16:14:40] GPU #1: TITAN V, 91.23 MH/s [2017-12-23 16:14:40] Total: 178.68 MH/s

Closer! went from 88MH/s to 178.68.

Change: added this to lyra2v2_cpu_hash_32: cudaFuncSetAttribute(lyra2v2_gpu_hash_32_2, cudaFuncAttributePreferredSharedMemoryCarveout, 50);

ref: http://docs.nvidia.com/cuda/volta-tuning-guide/index.html#l1-cache ref2: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

Need more testing, and setting it at 50 was just first try. No conditional code either to check for <sm_70

eruditej avatar Dec 23 '17 22:12 eruditej

[2017-12-23 16:26:58] GPU #0 Found nonce 2d0158f0 [2017-12-23 16:26:58] GPU #0: TITAN V, 99.20 MH/s [2017-12-23 16:26:58] GPU #0 Found nonce 2f40fff8 [2017-12-23 16:26:58] GPU #0: TITAN V, 99.89 MH/s [2017-12-23 16:26:58] GPU #1 Found nonce b00ddbf9 [2017-12-23 16:26:58] GPU #1: TITAN V, 100.83 MH/s [2017-12-23 16:26:58] Total: 196.27 MH/s

With the carveout set to 100, it is now on parity with ccminer2.2.3.

eruditej avatar Dec 23 '17 22:12 eruditej

at -i22: [2017-12-23 16:33:58] GPU #0 Found second nonce 1394a30e [2017-12-23 16:33:58] GPU #0 Found nonce 13b2563c [2017-12-23 16:33:58] GPU #0: TITAN V, 109.13 MH/s [2017-12-23 16:33:58] GPU #0 Found nonce 154b0036 [2017-12-23 16:33:58] GPU #0: TITAN V, 110.38 MH/s [2017-12-23 16:33:58] GPU #1 Found nonce 96746b65 [2017-12-23 16:33:58] GPU #1: TITAN V, 112.31 MH/s [2017-12-23 16:33:58] Total: 221.52 MH/s

wooooooooooooooo

eruditej avatar Dec 23 '17 22:12 eruditej

Nice. I have added the line. Would you like to test the binary? ccminer.zip

You are probably the only one using a Titan V, but you never know :-)

KlausT avatar Dec 23 '17 22:12 KlausT

Your ccminer is the only one really leveraging shared memory, and the volta is a shared memory speed beast.

Still doing some testing on my own.

probably needs to be wrapped in: if (cuda_arch[thr_id] >= 700)

eruditej avatar Dec 23 '17 23:12 eruditej

Looks like my 1070 doesn't care. I think I willl have to fix the cryptonight miner too, since there's shared memory too.

KlausT avatar Dec 23 '17 23:12 KlausT

Then this issue would be resolved.

eruditej avatar Dec 24 '17 02:12 eruditej

But since I was already in cubehash256:

__global__
void cubehash256_gpu_hash_32(uint32_t threads, uint32_t startNounce, uint2 *g_hash)
{
	uint32_t index = blockIdx.x * blockDim.x + threadIdx.x;
	uint32_t stride = blockDim.x * gridDim.x;
	for (int thread = index; thread < threads; thread += stride)
	{

		uint2 Hash[4];

		Hash[0] = __ldg(&g_hash[thread]);
		Hash[1] = __ldg(&g_hash[thread + 1 * threads]);
		Hash[2] = __ldg(&g_hash[thread + 2 * threads]);
		Hash[3] = __ldg(&g_hash[thread + 3 * threads]);

		uint32_t x[2][2][2][2][2] =
		{
			0xEA2BD4B4, 0xCCD6F29F, 0x63117E71, 0x35481EAE,
			0x22512D5B, 0xE5D94E63, 0x7E624131, 0xF4CC12BE,
			0xC2D0B696, 0x42AF2070, 0xD0720C35, 0x3361DA8C,
			0x28CCECA4, 0x8EF8AD83, 0x4680AC00, 0x40E5FBAB,
			0xD89041C3, 0x6107FBD5, 0x6C859D41, 0xF0B26679,
			0x09392549, 0x5FA25603, 0x65C892FD, 0x93CB6285,
			0x2AF2B5AE, 0x9E4B4E60, 0x774ABFDD, 0x85254725,
			0x15815AEB, 0x4AB6AAD6, 0x9CDAF8AF, 0xD6032C0A
		};

		x[0][0][0][0][0] ^= Hash[0].x;
		x[0][0][0][0][1] ^= Hash[0].y;
		x[0][0][0][1][0] ^= Hash[1].x;
		x[0][0][0][1][1] ^= Hash[1].y;
		x[0][0][1][0][0] ^= Hash[2].x;
		x[0][0][1][0][1] ^= Hash[2].y;
		x[0][0][1][1][0] ^= Hash[3].x;
		x[0][0][1][1][1] ^= Hash[3].y;

		rrounds(x);
		x[0][0][0][0][0] ^= 0x80U;
		rrounds(x);

		Final(x, (uint32_t*)Hash);

		g_hash[thread] = Hash[0];
		g_hash[1 * threads + thread] = Hash[1];
		g_hash[2 * threads + thread] = Hash[2];
		g_hash[3 * threads + thread] = Hash[3];
	}
}

__host__
void cubehash256_cpu_hash_32(int thr_id, uint32_t threads, uint32_t startNounce, uint64_t *d_hash)
{
	int blockSize;
	int minGridSize;
	int gridSize;

	cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,
		(void*)cubehash256_gpu_hash_32, 0, threads);
	gridSize = (threads + blockSize - 1) / blockSize;

	cubehash256_gpu_hash_32<<<gridSize, blockSize, 0, gpustream[thr_id]>>>(threads, startNounce, (uint2*)d_hash);
	CUDA_SAFE_CALL(cudaGetLastError());
	if(opt_debug)
		CUDA_SAFE_CALL(cudaDeviceSynchronize());
}

ref: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#occupancy-calculator

This is just a demonstration, for performance it is currently a no-op versus the hard coded values since they match. However, this gets rid of ugly invariants and makes the code more future-proof.

I had deleted the arch<500 code for troubleshooting, not for the purpose of this test. Here I just wanted to make sure our guesses about TPB/threads/etc were correct, and found that we were hard coding things cuda provides already.

Sadly, I was looking at the cuda9 feature cooperative groups and trying to figure out if it added anything, stumbled onto grid-stride (which this includes), and found : https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

Then I realized none of this cubehash code looked like modern deterministic cuda.

eruditej avatar Dec 24 '17 03:12 eruditej

You are probably the only one using a Titan V, but you never know :-)

I can also contribute testing on Titan V + Linux(I'm installing Titan V on Ubuntu 16.04)

I have tested cuda9 branch on Ubuntu 16.04 but faced this issue:

https://github.com/KlausT/ccminer/issues/84

(It sill works when I use --no-cpu-verify but hashrate is not good. 87 MH/s)

Also doing this

added this to lyra2v2_cpu_hash_32: cudaFuncSetAttribute(lyra2v2_gpu_hash_32_2,cudaFuncAttributePreferredSharedMemoryCarveout, 50);

resulted in wrong computation(shares rejected).

syoyo avatar Jan 03 '18 15:01 syoyo

I am terribly sorry for this novice question, but I cannot get ccminer to work properly. I have downloaded it from github, and most of the locations didn't come with the ccminer.exe file, which is located in this thread. However even after doing so and moving a few other files around, it still doesn't launch any algorithm. I assume it has to do with there not being a "config" file, but I am not sure what it needs to look like or copying and pasting a similar file into that folder would work.

I have a Titan V as well, I am familiar enough with software and reading code, but this is clearly beyond my realm of knowledge. I am trying to maximize the usage of my Titan V but for instance lyra2z algorithm simply causes my computer to freeze, neoscrypt has high output as well but doesn't work when it's ccminer or NVIDIA based. What I am not understanding is why the "excavator" versions of those scripts are working, but none of the rest? I believe my issue lies in the solution you found in the ccminer.exe file you added in comments, which brings me to my first question: what am I doing wrong with the setup/installation?

Sephiroth4269 avatar Feb 19 '18 17:02 Sephiroth4269

You are using Windows, right? Here is a version with the latest changes included: ccminer-820-Feb19-cuda91-x64.zip You can start mining neoscrypt like this:

ccminer.exe -a neoscrypt -i 18 -o stratum+tcp://hub.miningpoolhub.com:12012 -u username -p password Replace the 18 with the highest value that still works.

KlausT avatar Feb 19 '18 18:02 KlausT

I am once again sorry for this novice question, you have already gone above and beyond what I would expect, or have hoped. I wrote start.bat file with that information, correctly changing username/password to the proper information, and it still didn't run. Is it a .cmd file, or what?

I keep getting a stratum error.

stratum error

Sephiroth4269 avatar Feb 19 '18 19:02 Sephiroth4269

Looks like username or password are not correct. What pool do you want to use?

KlausT avatar Feb 19 '18 19:02 KlausT

I think I figured it out. Now here's another question. I was using the multipoolhub algorithm and software. There are apparently different coins that can be "mined" using the same algorithm, as well as different iterations of the same algorithm: ccminer.nanashi, nvidia.nanashi for example.

According to the readout one is exponentially more valuable than the other. How does ccminer determine that?

Sephiroth4269 avatar Feb 20 '18 06:02 Sephiroth4269

what does this mean

Alright I'm just lost now...this happened after I ran on neoscrypt...

Sephiroth4269 avatar Feb 20 '18 07:02 Sephiroth4269

Something's wrong there. I don't know if it's my code, or your card. Maybe I will have to make one or two small changes.

KlausT avatar Feb 20 '18 15:02 KlausT

I'm not sure, what I am noticing is that tuning algorithms with the titan v doesn't seem to work at all. Whether I change threads, blocks, difficulty divider, etc. Not a DAMN thing changes...

I know the titan v is new technology, but with current mining scripts it is leaving a whole lot of power on the table. In theory older scyrpts using older CUDA makes it look like the titan v can do exponentially more than during its 9.1v.

There has to be a solution somewhere in the middle that is correct. im thinking the issues im having and mentioned in this thread are a result of underutilization of the cores...

Now another issue is that, on the same program it freezes on neoscrypt when lyra2z dips... What would cause my computer to freeze dyring mining? I never had this issue with nicehash2, and from what I can tell only with neoscrypt palgin.

Sephiroth4269 avatar Feb 20 '18 15:02 Sephiroth4269

Would you please test if you are still getting validation errors with neoscrypt with this version? ccminer-820-Feb20-cuda91-x64.zip

on the same program it freezes on neoscrypt when lyra2z dips

What do you mean with this? Are you trying to run two mining programs at the same time?

KlausT avatar Feb 20 '18 18:02 KlausT

I am still getting the nonce error. I am not running two at a time what happens is the software determines the most profitable algorithm at the time and switches to it. I think I'm still struggling with the issue that the hash rates that the program says it gives, versus what the pool receives are way out of whack. Secondly it's to the point where Nice hash 2 won't even run daggermoto pascal algorithm...

I'm trying to find a reason to keep the Titan V but I keep having ridiculous issues that are way above my skill set to troubleshoot. The Titan V should be able to mine .001+ BTC worth of crypto a day. None of the current algorithms that "function" are doing so. I can use claymore and it works but my cards get really hot. Yet it doesn't work via minergate when I try and use that program? I'm a little lost as too the nature of my problems which seem to be solely compatibility issues.

What is the correct software, the correct algorithm, and correct settings for my Titan V to maximize it's usability in mining because so far I have hit a brick wall.

Sephiroth4269 avatar Feb 20 '18 18:02 Sephiroth4269

same issue whats the problem I'm thinking it could be the port in this particular instance, the mining pool port. Can you perhaps share with me a known correct mining pool address, I'll make an account if need be and see if that is the issue. Otherwise maybe Neoscrypt simply isn't compatible with titan V?

Sephiroth4269 avatar Feb 20 '18 18:02 Sephiroth4269