PK #BW?B B D Lesson_5_-_Optimizing_GPU_Programs/01-Optimizing_GPU_Programs.en.srt1
00:00:00,012 --> 00:00:04,232
Today we're going to talk about optimizing GPU programs. Now the whole reason we
2
00:00:04,232 --> 00:00:08,502
want to use a parallel program like the GPU, is to solve problems faster. And in
3
00:00:08,502 --> 00:00:12,662
turn the reason we might want to solve problems faster could be simply because
4
00:00:12,662 --> 00:00:16,808
we want to a solve problems faster. Or more often it's because we want to solve
5
00:00:16,808 --> 00:00:21,102
bigger problems, or solve more problems. So the good news is that, it's often
6
00:00:21,102 --> 00:00:25,002
the case that the first initial port of a problem gets a speed up, assuming that
7
00:00:25,002 --> 00:00:29,060
you got a parallel problem to begin with. And, in my experience, this is, this
8
00:00:29,060 --> 00:00:32,610
is actually when a lot of PPU programmers get hooked. You know, over the
9
00:00:32,610 --> 00:00:36,636
weekend, they go home and out of curiosity, they try porting some piece of their
10
00:00:36,636 --> 00:00:40,497
existing CPU code to the GPU, and they get a nice speed up, a 5x or 8x speed up.
11
00:00:40,497 --> 00:00:44,155
And that's, that's what sort of gets them hooked and makes them realize they
12
00:00:44,155 --> 00:00:48,152
could, put some more effort into this and get, get a bigger speed up. So if you
13
00:00:48,152 --> 00:00:52,154
have a naturally parallel problem, it's often the case that, that first initial
14
00:00:52,154 --> 00:00:56,190
Cuda port will, will get you, good speed up, and that's cool. But, you know, by
15
00:00:56,190 --> 00:01:00,330
definition GPU programmers care about performance, that's why they're using the
16
00:01:00,330 --> 00:01:04,196
GPU. That means they're often want to spend additional effort to maximize the
17
00:01:04,196 --> 00:01:07,895
speed up beyond that first initial try. So in this unit we're going to talk
18
00:01:07,895 --> 00:01:11,972
about how to optimize GPU programs. So cast your mind back to, unit two when we
19
00:01:11,972 --> 00:01:15,572
talked for a little bit about how some basic principals of efficient GPU
20
00:01:15,572 --> 00:01:20,411
programming. So check which of these principles accurately correspond to things
21
00:01:20,411 --> 00:01:24,601
we talked about in unit two about efficient GPU programming. Do we want to
22
00:01:24,601 --> 00:01:28,867
decrease a, arithmetic intensity? Do we want to decrease the time spent on
23
00:01:28,867 --> 00:01:33,515
memory operations per thread? Do we want to coalesce global memory accesses? Do
24
00:01:33,515 --> 00:01:38,798
we want to do fewer memory operations per thread? Do we want to avoid thread
25
00:01:38,798 --> 00:01:43,903
divergence? Do we want to move all data to shared memory?
PK #B`(" " D Lesson_5_-_Optimizing_GPU_Programs/02-Optimizing_GPU_Programs.en.srt1
00:00:00,012 --> 00:00:04,778
Well, arithmetic intensity, basically defined as the amount of math per memory.
2
00:00:04,778 --> 00:00:09,399
Or to be a little more precise, the amount of useful work, which is essentially
3
00:00:09,399 --> 00:00:13,872
math operations, that we do for the amount of time that we spend doing memory
4
00:00:13,872 --> 00:00:18,267
accesses. So, we want to maximize arithmetic intensity, not minimize it,
5
00:00:18,267 --> 00:00:22,459
spending less time on memory operations clearly helps. And the single most
6
00:00:22,459 --> 00:00:26,429
important way that you can do that, is to coalesce your global memory accesses.
7
00:00:26,429 --> 00:00:30,483
Now doing fewer memory operations per thread, well this may or may not help.
8
00:00:30,483 --> 00:00:34,584
Okay, for example, we might want to move data into the fast shared memory, do
9
00:00:34,584 --> 00:00:38,690
some operations on it there and then move it back. And that would be more total
10
00:00:38,690 --> 00:00:43,092
memory operations but the overall time spent on, accessing memory would go down,
11
00:00:43,092 --> 00:00:47,260
because we're doing a lot of frequent accesses within shared memory. So this
12
00:00:47,260 --> 00:00:51,869
one's not correct necessarily. By the same token, we don't necessarily want to
13
00:00:51,869 --> 00:00:56,266
move all data to shared memory, right? Because that could be an unnecessary
14
00:00:56,266 --> 00:01:00,910
memory operation. What we really care about is moving frequently accessed data
15
00:01:00,910 --> 00:01:05,529
to shared memory. So these three are not correct. Finally, the GPU does run most
16
00:01:05,529 --> 00:01:10,158
efficiently when adjacent threads take the same path through the code. That
17
00:01:10,158 --> 00:01:14,540
means it's good to avoid thread divergence. In this unit we'll give a lot more
18
00:01:14,540 --> 00:01:19,235
explanation, examples, and specifics for all three of these general guidelines.
PK #Bl I Lesson_5_-_Optimizing_GPU_Programs/03-Levels_of_Optimization_Part1.en.srt1
00:00:00,012 --> 00:00:04,126
Now, just as on a CPU, there are different levels of optimization for the GPU.
2
00:00:04,126 --> 00:00:08,021
There's picking good algorithms in the first place. There's some basic
3
00:00:08,021 --> 00:00:12,392
principles for writing efficient code. There's architecture specific detailed
4
00:00:12,392 --> 00:00:16,525
optimizations. And there's sort of bit twiddling micro-optimization at the
5
00:00:16,525 --> 00:00:20,989
instructions level. So, let's use a quiz to explore a few CPU examples. There's
6
00:00:20,989 --> 00:00:25,462
the use of vector registers such as SSE and AVX. There's the use of something
7
00:00:25,462 --> 00:00:30,082
like mergesort which runs an order n log n time versus insertion sort which runs
8
00:00:30,082 --> 00:00:34,702
and order n squared time. Next is writing cash-aware code, meaning write code
9
00:00:34,702 --> 00:00:39,280
that is likely to make efficient use of the cache. So, an example would be it's
10
00:00:39,280 --> 00:00:43,824
generally faster to traverse across the rows of a large 2D array than down the
11
00:00:43,824 --> 00:00:48,754
columns. Yeah, assuming that array is laid out in row order, you'll get better
12
00:00:48,754 --> 00:00:53,522
cache performance in that case. Or you can do explicit cache blocking, for
13
00:00:53,522 --> 00:00:58,359
example, for the L1 cache and a CPU core. Or finally, you could approximate the
14
00:00:58,359 --> 00:01:02,974
inverse square root of a floating point number by shifting it right one bit and
15
00:01:02,974 --> 00:01:05,755
subtracting it from the integer 0x5f3759df.
PK #Bw w I Lesson_5_-_Optimizing_GPU_Programs/04-Levels_of_Optimization_Part1.en.srt1
00:00:00,012 --> 00:00:04,316
Alright, let's get through these. So picking good algorithms, obviously this is
2
00:00:04,316 --> 00:00:08,562
the most important level of optimization. And when sorting a large random list,
3
00:00:08,562 --> 00:00:12,722
and Order n log n, algorithm like mergesort, is just intrinsically going to be
4
00:00:12,722 --> 00:00:17,027
faster than an Order n square algorithm like insertion sort. And clearly this is
5
00:00:17,027 --> 00:00:20,982
a case of choosing the right algorithm. And you know much of performance CPU
6
00:00:20,982 --> 00:00:24,249
programming relies on making efficient use of the cache. This is sort of the
7
00:00:24,249 --> 00:00:27,694
number one thing to keep in mind, when you're trying to write efficient CPU
8
00:00:27,694 --> 00:00:30,821
code, is that there is this big cache hierarchy and you want to write code
9
00:00:30,821 --> 00:00:34,862
that's going to make good use of it. And so, I would consider this an example of
10
00:00:34,862 --> 00:00:39,527
optimization level 2, right, it's a basic principle of efficiency on a CPU. On
11
00:00:39,527 --> 00:00:44,455
the other hand, blocking for the L1 cache, which means carefully sizing your
12
00:00:44,455 --> 00:00:49,342
working set to fit exactly in the per core cash, on the CPU. This is a detailed
13
00:00:49,342 --> 00:00:53,490
optimization, it's going to depend on the exact CPU model you're using, because
14
00:00:53,490 --> 00:00:57,828
every CPU core has different size L1 cache. I would also put the use of vector
15
00:00:57,828 --> 00:01:02,382
registers like SSE or AVX intrinsics into this category, right, these are, not
16
00:01:02,382 --> 00:01:07,470
every CPU has SSE some have AVX. Some don't have either. So, the use of, vector
17
00:01:07,470 --> 00:01:11,628
registers, the choice of blocking for the L1 cache, cache blocking, I would
18
00:01:11,628 --> 00:01:16,375
consider these architectures specific detailed optimizations. So these kind of
19
00:01:16,375 --> 00:01:21,061
architectural-specific optimizations tend to be the domain of, of what I call
20
00:01:21,061 --> 00:01:25,529
ninja programmers. And we'll also touch on a few ninja-style GPU topics like
21
00:01:25,529 --> 00:01:30,214
shared memory bank conflicts. I'll try to highlight these, ninja topics with
22
00:01:30,214 --> 00:01:35,092
this little icon here. The idea is that these are really you know, not, needed
23
00:01:35,092 --> 00:01:39,910
or necessarily accessible to every programmer. They tend to be when you're sort
24
00:01:39,910 --> 00:01:44,436
of squabbling over the last few percent of, optimizations, the last few percent
25
00:01:44,436 --> 00:01:48,128
of speedup. I think this is one of the big differences between CPU and GPU
26
00:01:48,128 --> 00:01:52,390
programming. On a CPU, these sort of inter-level optimizations can make a, a
27
00:01:52,390 --> 00:01:56,545
really big difference. For example, if you ignore the existence of SSE
28
00:01:56,545 --> 00:02:00,766
registers, or AVX registers on really modern processors, then you're only
29
00:02:00,766 --> 00:02:05,259
getting a fourth or an eighth of the power of each, core on your CPU. So you're
30
00:02:05,259 --> 00:02:09,852
taking a huge hit in potential performance. On the GPU, for the most part as a
31
00:02:09,852 --> 00:02:15,292
rule of thumb, the, speedups to be gained by these sort of ninja optimizations
32
00:02:15,292 --> 00:02:20,355
are, are usually comparatively minor. Okay, so we're talking more like a 30% or
33
00:02:20,355 --> 00:02:25,100
an 80% speedup, by using some of the techniques that we'll talk about with this
34
00:02:25,100 --> 00:02:29,366
little ninja icon. Versus the speed up that you might hope to get from just
35
00:02:29,366 --> 00:02:33,654
picking the right algorithm in the first place or, you know, just obeying the
36
00:02:33,654 --> 00:02:38,009
basic principles of efficiency on a GPU such as coalescing your global memory
37
00:02:38,009 --> 00:02:42,417
accesses. These can also, often make a difference of three, three times, ten
38
00:02:42,417 --> 00:02:46,699
times, sometimes more. So these, you know, just doing a good job, the sort of,
39
00:02:46,699 --> 00:02:51,425
higher level, principles of optimization matters more on the GPU. And the ninja
40
00:02:51,425 --> 00:02:55,914
level optimizations definitely help, sometimes you can get a speedup here, but
41
00:02:55,914 --> 00:03:00,202
it's not a sort of vital step to extract maximum performance the way that it is
42
00:03:00,202 --> 00:03:04,449
on a CPU. Where if you don't If you don't start doing something to make sure
43
00:03:04,449 --> 00:03:08,733
you're using your vector registers, you're going to take a huge hit in terms of
44
00:03:08,733 --> 00:03:12,868
the efficiency that you get out of, out of your modern CPU. And finally this
45
00:03:12,868 --> 00:03:16,832
last one is, I really just threw that in there for fun. Clearly, shifting a
46
00:03:16,832 --> 00:03:20,776
number, casting it shifting a floating point number, casting it to a float and
47
00:03:20,776 --> 00:03:24,430
subtracting it from this magic constant, is firmly in the category of micro
48
00:03:24,430 --> 00:03:28,758
optimization at this sort of bit twitling instruction level. This is an infamous
49
00:03:28,758 --> 00:03:33,081
hack that's been used in many places, most famously in the video game Quake 3,
50
00:03:33,081 --> 00:03:37,457
and if you're curious go to Wikipedia and look up fast inverse square root. Of
51
00:03:37,457 --> 00:03:41,615
course these are firmly in the ninja category, and we won't be talking about
52
00:03:41,615 --> 00:03:44,418
them at all over the today or in rest of the course.
PK #B ]f J Lesson_5_-_Optimizing_GPU_Programs/05-Levels_of_Optimization_Part_2.en.srt1
00:00:00,012 --> 00:00:04,294
Let me give a few examples on the GPU. So on the GPU picking good algorithms,
2
00:00:04,294 --> 00:00:08,855
really means, picking algorithms that are fundamentally parallel. My favorite
3
00:00:08,855 --> 00:00:13,139
example of this, is related to the example I gave for the CPU where I compared
4
00:00:13,139 --> 00:00:17,754
mergesort, which is Order n log n with insertion sort, which is Order n squared.
5
00:00:17,755 --> 00:00:22,139
Now in fact most CPU algorithm texts would probably tell you that heapsort,
6
00:00:22,139 --> 00:00:26,486
which is also Order n log in, is a better choice than either of these, it tends
7
00:00:26,486 --> 00:00:30,978
to run a little better on the CPU. But it turns out that heapsort is, all about
8
00:00:30,978 --> 00:00:35,199
updating this shared data structure called the heap, and that's a very hard
9
00:00:35,199 --> 00:00:39,595
thing to do in parallel. And so, in practice a mergesort is a much better choice
10
00:00:39,595 --> 00:00:44,002
on a parallel machine like the GPU. Terms of basic principles for efficiency,
11
00:00:44,002 --> 00:00:47,947
we've already discussed some of these. A big one is coalescing global memory
12
00:00:47,947 --> 00:00:51,912
accesses, and another is using shared memory, which is faster for frequently
13
00:00:51,912 --> 00:00:55,942
accessed data within a thread block. We'll just touch on a few GPU Ninja topics
14
00:00:55,942 --> 00:00:59,846
like optimizing bank conflicts and shared memory, and optimizing the use of
15
00:00:59,846 --> 00:01:04,292
registers. And I've certainly seen my share of, sort of GPU bit twitteling micro
16
00:01:04,292 --> 00:01:08,893
optimization. One example that comes to mind is, the use of, floating-point
17
00:01:08,893 --> 00:01:13,233
denorms in order to, abuse the floating-point arithmetic units for additional
18
00:01:13,233 --> 00:01:17,219
integer math performance. And this kind of stuff is just firmly in the Ninja
19
00:01:17,219 --> 00:01:21,272
category. You know, it's almost never worth the time or the code obfuscation
20
00:01:21,272 --> 00:01:24,842
that results, and, we won't be talking about them at all today.
PK #Bw w 7 Lesson_5_-_Optimizing_GPU_Programs/06-APOD_Part1.en.srt1
00:00:00,012 --> 00:00:04,896
So, you can find an entire you know, entire books, entire courses on how to go
2
00:00:04,896 --> 00:00:10,140
about analyzing and optimizing code. The real point is simply that you should
3
00:00:10,140 --> 00:00:14,419
have a systematic optimization process. So, I'm going to introduce one
4
00:00:14,419 --> 00:00:19,084
systematic optimization process called APOD. Apod stands for Analyze,
5
00:00:19,084 --> 00:00:23,838
Parallelize, Optimize, and Deploy. And the real point to this process is that
6
00:00:23,838 --> 00:00:28,284
it's a cycle. There's nothing really new here. This is just good software
7
00:00:28,284 --> 00:00:33,348
engineering discipline. But the point is that we want to breakdown the process
8
00:00:33,348 --> 00:00:38,306
of optimization into several steps, and to remember not to short circuit really
9
00:00:38,306 --> 00:00:43,190
important one, which is to actually deploy the optimizations that you've made
10
00:00:43,190 --> 00:00:48,222
into the real world, get some feedback, see, see how they're performing on real
11
00:00:48,222 --> 00:00:53,358
world datasets. It's too easy to spend all your time circling around these two
12
00:00:53,358 --> 00:00:58,326
steps and optimizing a vacuum forgetting that, you know, actually you might be
13
00:00:58,326 --> 00:01:03,006
able to get a bigger speedup by doing a better job of thinking about what you're
14
00:01:03,006 --> 00:01:07,758
trying to accomplish or by going out and taking a partial optimization into the
15
00:01:07,758 --> 00:01:12,774
real world and seeing how it performs, you know, getting guidance from that. So,
16
00:01:12,774 --> 00:01:17,280
this is really nothing more than good software engineering discipline. Let's
17
00:01:17,280 --> 00:01:19,033
take these steps one at a time.
PK #B0, 7 Lesson_5_-_Optimizing_GPU_Programs/07-APOD_Part2.en.srt1
00:00:00,012 --> 00:00:03,288
So the analyze stage is really all about, profiling the whole application,
2
00:00:03,288 --> 00:00:06,668
looking not just at the kernals that you intend to parallelize, but looking at
3
00:00:06,668 --> 00:00:10,048
the whole thing, and trying to figure out, where can this application benefit
4
00:00:10,048 --> 00:00:13,674
from parallelization? Right? And how much can you expect to benefit? So once
5
00:00:13,674 --> 00:00:17,054
you've decided that there's a region of the code that you need to parallelize,
6
00:00:17,054 --> 00:00:20,412
of course there's different approaches to doing this. Certainly no code is
7
00:00:20,412 --> 00:00:23,638
better than the code you don't have to write at all. So if you can find that
8
00:00:23,638 --> 00:00:27,136
there's a parallel library for the GPU that already does exactly what you want,
9
00:00:27,136 --> 00:00:31,088
then great, you're done. You can simply call out to that library. Sometimes you
10
00:00:31,088 --> 00:00:35,220
have a lot of code and you want to be able to instrument it. You want to do a
11
00:00:35,220 --> 00:00:39,708
minimal amount of work to get a little bit of parallelization and there's an
12
00:00:39,708 --> 00:00:44,189
approach for this called Directives. The most well known one on a CPU is called
13
00:00:44,189 --> 00:00:48,670
Open MP. There's another cross platform standard called Open ACC which has
14
00:00:48,670 --> 00:00:53,290
emerged which ACC stands for accelerator. This is sort of an extension of Open
15
00:00:53,290 --> 00:00:57,792
MP to encompass the ideas of accelerators like the GPU. And so if you're looking
16
00:00:57,792 --> 00:01:01,888
at GPU programming Open ACC is a, a really lightweight way to experiment with
17
00:01:01,888 --> 00:01:05,762
it. But of course sometimes what you want to do is actually go in and write a
18
00:01:05,762 --> 00:01:09,920
parallel routine and naturally that's what we've been focusing on throughout
19
00:01:09,920 --> 00:01:14,218
this course using CUDA C ++. So of course we're going to, we're going to focus
20
00:01:14,218 --> 00:01:18,350
here. So assuming that you're going to be coding something up from scratch for
21
00:01:18,350 --> 00:01:22,162
this purpose, the next choice is how to pick an algorithm. And this is a huge
22
00:01:22,162 --> 00:01:25,632
deal. This is your real chance to make a huge improvement. Pick the right
23
00:01:25,632 --> 00:01:28,987
algorithm. So all, all the bit twiddling optimization in the world won't gain,
24
00:01:28,987 --> 00:01:32,187
gain you nearly as much as choosing the right sort of fundamentally parallel
25
00:01:32,187 --> 00:01:35,633
friendly algorithm. I'll give a couple of simple examples in this unit, and
26
00:01:35,633 --> 00:01:39,033
you'll see more examples in Unit 6. There's no recipe for picking the right
27
00:01:39,033 --> 00:01:42,311
algorithm. What you need to do is sort of, think deeply about what is the
28
00:01:42,311 --> 00:01:45,926
parallelism in your problem. But we'll try to give you a few examples of that,
29
00:01:45,926 --> 00:01:50,057
that you have a little practice. So once you've decided how to parallelize your
30
00:01:50,057 --> 00:01:54,217
algorithim, then you want to optimize the code, and there will be sort of, a, a
31
00:01:54,217 --> 00:01:58,469
cycle between these. As you'll see in the example, you try a parallelization,
32
00:01:58,469 --> 00:02:02,629
study how well it does, you know, suggest some changes that might suggest a way
33
00:02:02,629 --> 00:02:07,174
that you approach the parallelization differently. So we're really talking about
34
00:02:07,174 --> 00:02:11,496
profile-driven optimization. By which I mean measure it. Measure, measure,
35
00:02:11,496 --> 00:02:15,462
measure, measure how fast things are going and use that to base your decisions.
36
00:02:15,462 --> 00:02:19,430
Don't just sort of take a guess at what's going to work well and what doesn't.
37
00:02:19,430 --> 00:02:23,121
And finally, the deploy step. So, you know, look in this class we do these
38
00:02:23,121 --> 00:02:27,391
little home works, and they're fairly self-contained so you can get them done in
39
00:02:27,391 --> 00:02:31,182
a reasonable amount of time. So, the, the process of actually deploying a GPU
40
00:02:31,182 --> 00:02:34,682
accelerated code into real use is not going to come up a lot in this class. So,
41
00:02:34,682 --> 00:02:38,308
so consider this just sort of free software engineering advice. You know when
42
00:02:38,308 --> 00:02:41,836
you're working on a real code, it's a really, really bad idea to optimize in a
43
00:02:41,836 --> 00:02:46,125
vacuum. And what I mean by this is you can really, you can easily spend tons of
44
00:02:46,125 --> 00:02:50,545
time adding tons of unnecessary complexity to your code, speeding up a kernel
45
00:02:50,545 --> 00:02:54,468
way past the point where it's no longer a bottleneck. And it's just
46
00:02:54,468 --> 00:02:59,154
fundamentally a good idea to push any improvements through to the end code,
47
00:02:59,154 --> 00:03:03,170
you're making it real by doing that. And making it real as soon as possible
48
00:03:03,170 --> 00:03:06,998
insures that you're running and profiling and paralyzing and optimizing real
49
00:03:06,998 --> 00:03:10,855
workloads. And the other thing to remember is that, you know, even small
50
00:03:10,855 --> 00:03:14,870
improvements are, are useful. If your code is, if you make your code 50% faster
51
00:03:14,870 --> 00:03:18,774
or two times faster or four times faster, that's useful and you can push it out
52
00:03:18,774 --> 00:03:22,570
to, to the users of your code; you can start employing it and making it real.
53
00:03:22,570 --> 00:03:26,557
Even if you think that there's a factor of 20 times speedup in the wings. You
54
00:03:26,557 --> 00:03:30,972
know, really the, the advice here again is. Is be disciplined, you know analyze
55
00:03:30,972 --> 00:03:35,161
what you're doing. You know decide how you're going to para[LAUGH]elize it,
56
00:03:35,161 --> 00:03:39,129
decide how you're going to optimize it, by studying that code, you know and
57
00:03:39,129 --> 00:03:43,880
measuring it using, using profile driven optimization. And finally, you know be
58
00:03:43,880 --> 00:03:48,603
sure to deploy frequently okay, so, so, deploy early and often. Remember that
59
00:03:48,603 --> 00:03:50,989
this whole thing is intended to be a cycle.
PK #BB C Lesson_5_-_Optimizing_GPU_Programs/08-Weak_Vs_Strong_Scaling.en.srt1
00:00:00,012 --> 00:00:03,288
It's also important to understand what it is you're trying to accomplish by
2
00:00:03,288 --> 00:00:06,737
parallelizing the code. Say you have a serial program that solves a certain
3
00:00:06,737 --> 00:00:10,390
problem size p in time t. You know, maybe your simulating how a certain protein
4
00:00:10,390 --> 00:00:14,285
molecule folds, and that simulation takes an hour to run. Weak scaling is what
5
00:00:14,285 --> 00:00:18,061
we refer to when we're talking about using parallel computing to run a larger
6
00:00:18,061 --> 00:00:21,642
problem, or perhaps to run more problems. So if you wanted to run a bigger
7
00:00:21,642 --> 00:00:25,422
protein, if you wanted to fold a bigger protein in your simulation and still
8
00:00:25,422 --> 00:00:29,202
take an hour, or you wanted to fold many small proteins in the same hour, then
9
00:00:29,202 --> 00:00:33,499
that would be an example of weak scaling. Strong scaling means using parallel
10
00:00:33,499 --> 00:00:37,760
computing to run a problem faster a single problem size. So, if you wanted to
11
00:00:37,760 --> 00:00:41,792
fold the same protein that you were folding in an hour, and you wanted to fold
12
00:00:41,792 --> 00:00:46,139
that in a minute, then that would be an example of strong scaling. More formally
13
00:00:46,139 --> 00:00:50,494
we could say that weak scaling describes how the solution size varies with the
14
00:00:50,494 --> 00:00:55,302
fixed problem size per processor as you add processors. We could say that strong
15
00:00:55,302 --> 00:01:00,482
scaling describes how the solution size varies as you add processors to tackle a
16
00:01:00,482 --> 00:01:02,188
fixed total problem size.
PK #B_ 4 Lesson_5_-_Optimizing_GPU_Programs/09-Analyze.en.srt1
00:00:00,012 --> 00:00:03,642
Now there are really two components to the analyze stage, understanding where
2
00:00:03,642 --> 00:00:07,327
your application spends its time, and understanding what you want to do with the
3
00:00:07,327 --> 00:00:11,207
additional parallelism. So understanding hotspots is pretty straight forward.
4
00:00:11,207 --> 00:00:14,906
Often the programmer will have some intuition for this. They'll have some idea
5
00:00:14,906 --> 00:00:18,415
where the time is being spent in their code. But intuition can be wrong and
6
00:00:18,415 --> 00:00:22,798
there's no substitute for data, so don't rely on intuition. So run your favorite
7
00:00:22,798 --> 00:00:27,286
Profiler, whether that's G Prof or V Tune, or the brilliantly named Very Sleepy
8
00:00:27,286 --> 00:00:31,969
Profiler. And look at how much time the various functions, or lines of code,
9
00:00:31,969 --> 00:00:35,376
take. You're going to get some output back that tells how much time each
10
00:00:35,376 --> 00:00:39,024
function takes, so some functions might, this function might take 10% of the
11
00:00:39,024 --> 00:00:42,747
time, this function might take 20% of the total time of the program. And in this
12
00:00:42,747 --> 00:00:46,452
case, the program is spending 50% of its time in this one function, so there's a
13
00:00:46,452 --> 00:00:50,356
clear hotspot here. Maybe your intuition would've told you that this is where
14
00:00:50,356 --> 00:00:54,271
the time is going. Often you do know that first hotspot, but sometimes you're
15
00:00:54,271 --> 00:00:57,925
surprised by that second and third hotspot, sometimes they're not what you
16
00:00:57,925 --> 00:01:01,542
think. And this gives you a starting point of where to begin parallelizing the
17
00:01:01,542 --> 00:01:04,931
code, and an indication as well of how much speedup you can expect to do so.
18
00:01:04,931 --> 00:01:08,404
Let's try that at home with a quiz. So in this example that we've shown here,
19
00:01:08,404 --> 00:01:12,199
the question is, what is the maximum speedup possible that you could hope for if
20
00:01:12,199 --> 00:01:15,776
you parallelized this top function, the function which takes the most time?
21
00:01:15,776 --> 00:01:20,075
Okay, that's this one. And then I'll say, well, what if you parallelize the top
22
00:01:20,075 --> 00:01:23,173
three functions? So what if you parallelize all of these?
PK #B7 @ @ 4 Lesson_5_-_Optimizing_GPU_Programs/10-Analyze.en.srt1
00:00:00,012 --> 00:00:04,812
The top function takes 50% of the run-time and that means even if you made it go
2
00:00:04,812 --> 00:00:09,456
to 0 time, the program would only go twice as fast. So, you get a 2x speed up.
3
00:00:09,456 --> 00:00:13,988
The top three functions between them take 50, 20, and 10% or 80% of the total
4
00:00:13,988 --> 00:00:18,757
time. So, if you were to make those all drop to 0, the program would be taking
5
00:00:18,757 --> 00:00:23,312
20% of the time that it used to take and you'd have gotten the 5x speed up.
PK #B}LG
G
C Lesson_5_-_Optimizing_GPU_Programs/11-Understanding_Hotspots.en.srt1
00:00:00,012 --> 00:00:03,596
So, this is an example of a really key principal in parallel programming and,
2
00:00:03,596 --> 00:00:07,350
and indeed, in, in all of the Computer Science. We call that principle, Amdahl's
3
00:00:07,350 --> 00:00:10,846
Law. The total speedup you get is limited by the portion of a time you spent
4
00:00:10,846 --> 00:00:14,933
doing something that can be parallelized. In the limit of enough parallelism,
5
00:00:14,933 --> 00:00:19,093
you know, doing a good enough job with parallelism, having a parallel enough
6
00:00:19,093 --> 00:00:23,253
machine, your maximum speedup goes to 1 over 1 over P, where P is defined as the
7
00:00:23,253 --> 00:00:27,738
percentage or portion of parallelizable time, time spent on this parallelizable
8
00:00:27,738 --> 00:00:32,461
activity. So, it's really important that you know these limits. So, for example,
9
00:00:32,461 --> 00:00:36,677
if your application spends 50% of its time on IO, reading and writing to the
10
00:00:36,677 --> 00:00:41,228
disk, that implies that you can't possibly do better than a 2x speedup. Unless
11
00:00:41,228 --> 00:00:45,282
that is, you can come up with the way to re-factor the computation. For example,
12
00:00:45,282 --> 00:00:49,064
maybe you can read and write larger batches of data at once, see the GPU has
13
00:00:49,064 --> 00:00:52,784
more work to do on a single batch and, and I want to say this, this situation
14
00:00:52,784 --> 00:00:56,901
comes up pretty often in practice, right? Gpus are so fast that often porting
15
00:00:56,901 --> 00:01:00,866
the hotspot to CUDA makes that hotspot so much faster that's no longer the
16
00:01:00,866 --> 00:01:05,368
bottleneck. And that's why I made the point earlier that if you have a bunch of
17
00:01:05,368 --> 00:01:09,455
functions, you might need to do more than port a single hotspot to the GPU,
18
00:01:09,455 --> 00:01:13,714
right? You might find that after you've ported that first hotspot to the GPU,
19
00:01:13,714 --> 00:01:17,643
it's gotten so much faster that now you need to focus somewhere else. And
20
00:01:17,643 --> 00:01:21,864
remember, once you've, once you've crossed this point, there's little point in
21
00:01:21,864 --> 00:01:26,152
continuing to optimize this. The total time on your application is not going to
22
00:01:26,152 --> 00:01:30,375
go down very much if you continue to work on this, this hot spot. You need to
23
00:01:30,375 --> 00:01:33,177
shift your attention to the new bottlenecks.
PK #B 8 Lesson_5_-_Optimizing_GPU_Programs/12-Parallelize.en.srt1
00:00:00,012 --> 00:00:03,981
Okay, enough about analyze. Let's talk about parallelize. We're going to use a
2
00:00:03,981 --> 00:00:07,875
running example to illustrate the concepts that we'll go through over the rest
3
00:00:07,875 --> 00:00:12,157
of this unit. And that example is matrix transpose. It's a very simple algorithm
4
00:00:12,157 --> 00:00:15,817
that still has enough meat in it to illustrate a lot of the points we want to
5
00:00:15,817 --> 00:00:19,681
bring up. Matrix transpose just takes some matrix and exchanges the rows and
6
00:00:19,681 --> 00:00:23,129
columns. So, a particular element in this row, for example, would have
7
00:00:23,129 --> 00:00:27,130
coordinate a,b. So, its i coordinate is a and its j coordinate is b. And its new
8
00:00:27,130 --> 00:00:30,906
i coordinate, the transpose matrix will be b, and its new j coordinate will be
9
00:00:30,906 --> 00:00:34,500
a. Pretty simple. To make our code even simpler, we're going to actually going
10
00:00:34,500 --> 00:00:38,217
to restrict ourselves to square matrices. And you'll remember that, you know,
11
00:00:38,217 --> 00:00:41,344
this came up all the way back in Unit 2 as a really common parallel
12
00:00:41,344 --> 00:00:45,009
communication pattern. So, you know, transposing a matrix or doing some
13
00:00:45,009 --> 00:00:49,029
equivalent operation on array, or an image or structure of arrays, this comes up
14
00:00:49,029 --> 00:00:53,141
literally all the time. So, this is, even though it's a simple example as you'll
15
00:00:53,141 --> 00:00:57,167
see, there's a lot of opportunity to optimize it and it's a really important
16
00:00:57,167 --> 00:00:57,505
one.
PK #B$ I Lesson_5_-_Optimizing_GPU_Programs/13-Transpose_Code_Example_Part1.en.srt1
00:00:00,012 --> 00:00:05,005
Okay. Let's work through some code. Now in this series of coding exercises I'm
2
00:00:05,005 --> 00:00:09,565
going to show you, I'm going to be transposing a matrix and the coordinates of
3
00:00:09,565 --> 00:00:13,821
this matrix are going to be giving us i and j, i refers to the column of an
4
00:00:13,821 --> 00:00:18,665
elements, j refers to the row of an element. Now, this matrix is going to be
5
00:00:18,665 --> 00:00:23,670
laid out in row major order, meaning that in memory, all the elements of a given
6
00:00:23,670 --> 00:00:28,407
row will be adjacent and in the next row, and then in the next row, and so on.
7
00:00:28,407 --> 00:00:33,277
And remember that our goal is to take for every element ij, we want to switch it
8
00:00:33,277 --> 00:00:38,199
with element ji. So, that's what the code is going to do. Let me slide this over
9
00:00:38,199 --> 00:00:43,342
here as a reference and I'll bring it the Code window. So, here's how you would
10
00:00:43,342 --> 00:00:48,547
code transpose on a serial machine like a CPU, really simple code. I'm going to
11
00:00:48,547 --> 00:00:53,503
pass in two matrices in and out and I'm going to walk through a for loop that
12
00:00:53,503 --> 00:00:58,627
loops over the rows, j, and with N, it throw loops over the column and then,
13
00:00:58,627 --> 00:01:04,094
this is where you can tell that the matrix is laid out in row major order. I, j
14
00:01:04,094 --> 00:01:10,604
and the input matrix maps to j, i in the output matrix. Okay, really simple. And
15
00:01:10,604 --> 00:01:16,600
here's how you would call that code. So, I'm going to allocate a few matrices,
16
00:01:16,600 --> 00:01:22,014
an input matrix and an output matrix. And I call this little utility routine,
17
00:01:22,014 --> 00:01:26,676
call fillmatrix which is just going to put a bunch of consecutive numbers
18
00:01:26,676 --> 00:01:31,007
inside the matrix. I call the transpose routine. And I'm going to put the result
19
00:01:31,007 --> 00:01:35,069
in the matrix I call gold, meaning that this is our golden reference. We know
20
00:01:35,069 --> 00:01:38,721
that this is, this code is correct. And we're going to compare the various GPU
21
00:01:38,721 --> 00:01:42,774
versions that we code to this, to this golden reference, and then I print it.
22
00:01:42,774 --> 00:01:44,679
Okay. So, let's compile and run this code.
PK #B
I Lesson_5_-_Optimizing_GPU_Programs/14-Transpose_Code_Example_Part2.en.srt1
00:00:00,012 --> 00:00:05,294
We compile it and run it. I made a really small matrix to begin with so we can
2
00:00:05,294 --> 00:00:10,779
just see that it's acting correctly. So as you see our input matrix goes from 0
3
00:00:10,779 --> 00:00:16,522
to 63. And the transposed version does the same thing down the columns instead
4
00:00:16,522 --> 00:00:21,710
of the rows. So we've written correct code, as we expected. And we're just going
5
00:00:21,710 --> 00:00:26,030
to store this routine for generating this transpose in the gold matrix and we'll
6
00:00:26,030 --> 00:00:30,152
use that to compare, so we don't have to keep printing out these matrices. For
7
00:00:30,152 --> 00:00:33,992
the rest of the, lecture I'll leave a smaller window so that we can see the
8
00:00:33,992 --> 00:00:38,368
results without actually printing out the matrices. Ok, so here's how you code
9
00:00:38,368 --> 00:00:42,555
transpose on a serial machine like a CPU. And here's how you call it. So let's
10
00:00:42,555 --> 00:00:51,136
do our first CUDA version. We'll copy this, paste it, make it a kernal, rename
11
00:00:51,136 --> 00:00:57,542
it. So now we've got a kernal that will operate on the GPU and call it. We need
12
00:00:57,542 --> 00:01:03,878
to do a few things, we need to allocate space for the input and output matrix on
13
00:01:03,878 --> 00:01:09,917
the device, so allocate those and we need to copy the input matrix on to the
14
00:01:09,917 --> 00:01:16,154
device CUDA kernel so what you see here is I've taken that same kernel that we
15
00:01:16,154 --> 00:01:22,292
wrote a serial kernel I've called it on a single thread, running at a single
16
00:01:22,292 --> 00:01:28,475
block, passed in din and dout and then I copy my, my result and we're done.
17
00:01:28,475 --> 00:01:32,604
See, that was quick. Gpu programming is easy. I'm actually making an important
18
00:01:32,604 --> 00:01:36,536
point here. Okay, this code is going to run correctly and it was really easy to
19
00:01:36,536 --> 00:01:40,576
write. And the problem, of course, is that it won't perform very well, right?
20
00:01:40,576 --> 00:01:44,938
It's using a single thread which is a really small fraction of the horsepower of
21
00:01:44,938 --> 00:01:48,964
a GPU, especially a high end GPU which can run tens of thousands of threads
22
00:01:48,964 --> 00:01:53,465
concurrently. So let's go ahead and time this. Okay, so I've added some timer
23
00:01:53,465 --> 00:01:57,970
code. Okay, we're going to start the timer, measure it and I also added I'm
24
00:01:57,970 --> 00:02:02,524
using a helper function here called compare matrices, which just compares the
25
00:02:02,524 --> 00:02:06,940
output to, to the golden matrix that we calculated to make sure that we did the
26
00:02:06,940 --> 00:02:13,790
transpose correctly. Lets compile again, run again. Great, so as we performed
27
00:02:13,790 --> 00:02:19,928
the transpose successfully and it took some very small fraction of the
28
00:02:19,928 --> 00:02:26,711
millisecond. But this was a really small problem. Lets make this problem bigger
29
00:02:26,711 --> 00:02:32,286
again, say 1024 by 1024 matrix. So now we've got roughly a million elements to
30
00:02:32,286 --> 00:02:36,810
touch, and we're going to do this in a single thread. Okay, now this is taking
31
00:02:36,810 --> 00:02:39,336
almost half a second, that's a long time.
PK #B<ҏ I Lesson_5_-_Optimizing_GPU_Programs/15-Transpose_Code_Example_Part3.en.srt1
00:00:00,012 --> 00:00:04,296
So, let's keep track of what we're doing. Version 1 is serial in a single thread
2
00:00:04,296 --> 00:00:08,556
and it took about 466 milliseconds. So, 466 milliseconds is pretty slow, but
3
00:00:08,556 --> 00:00:12,804
sometimes that's okay. For code, that's only going to be executed once, for code
4
00:00:12,804 --> 00:00:16,899
that's not performance critical at all, or code that's going to run on a really
5
00:00:16,899 --> 00:00:20,799
small dataset, like that 8 by 8 matrix that we started with, it's just not
6
00:00:20,799 --> 00:00:25,173
worthwhile to optimize the heck out of this. So, even though this simple serial
7
00:00:25,173 --> 00:00:29,260
kernel may seem very naive, that's really sometimes the right thing to do. So,
8
00:00:29,260 --> 00:00:33,347
keep this in mind when you're optimizing, think about what you need to optimize,
9
00:00:33,347 --> 00:00:37,823
whether it's important. Now, let's assume that in fact, performance is critical
10
00:00:37,823 --> 00:00:41,866
on this section, that's why we're optimizing it. And let's go back to the code
11
00:00:41,866 --> 00:00:46,947
and see what we can do. Now, an easy step would be to launch one thread for each
12
00:00:46,947 --> 00:00:52,052
row of the input, okay? So now, here's the code that does that. In this code,
13
00:00:52,052 --> 00:00:57,261
we're going to launch one thread per row of the output matrix. So, the value of
14
00:00:57,261 --> 00:01:01,941
i is going to be fixed by the thread ID and every thread is going to execute
15
00:01:01,941 --> 00:01:06,387
just the outer loop of this code we saw before and the inner loop we're
16
00:01:06,387 --> 00:01:11,282
essentially handing off to be run across many different threads. So, these,
17
00:01:11,282 --> 00:01:15,312
these two codes are almost identical and the only difference is that we're
18
00:01:15,312 --> 00:01:19,881
launching threads instead of looping over values of i. Let's time this. Okay, so
19
00:01:19,881 --> 00:01:23,445
here's the code for calling our new function. We're going to launch the
20
00:01:23,445 --> 00:01:27,290
function, transpose parallel per row, as a kernel. We're going to launch a
21
00:01:27,290 --> 00:01:31,269
single thread block consisting of n threads. Remember, n is the size of our
22
00:01:31,269 --> 00:01:35,409
matrix, 1,024, currently. We're going to pass in the input matrix, pull out the
23
00:01:35,409 --> 00:01:39,441
output matrix, copy it, and then we're going to print out the timing and verify
24
00:01:39,441 --> 00:01:45,127
it. Let's compile and run this code. Okay. Transpose serial ran 484 milliseconds
25
00:01:45,127 --> 00:01:50,385
again, roughly what we saw before. Transpose parallel per row is running in 4.7
26
00:01:50,385 --> 00:01:55,807
milliseconds. So, obviously, we're making a huge improvement by parallelizing
27
00:01:55,807 --> 00:02:01,029
this just across the threads of a single thread block. So, let's note that down,
28
00:02:01,029 --> 00:02:04,150
4.7 milliseconds, roughly a 100x improvement.
PK #B"D I Lesson_5_-_Optimizing_GPU_Programs/16-Transport_Code_Example_Part4.en.srt1
00:00:00,012 --> 00:00:03,558
So, for our next version, let's parallelize this per element, right? So, we
2
00:00:03,558 --> 00:00:07,366
essentially replace the inner loop with a thread launch, let's replace the outer
3
00:00:07,366 --> 00:00:10,584
antenna loops with the thread launch. And I'm going to leave this as a
4
00:00:10,584 --> 00:00:14,383
programming exercise for you to do. We'll give you the same code that we've been
5
00:00:14,383 --> 00:00:18,442
using here with a little bit more instrumentation that should be pretty obvious.
6
00:00:18,443 --> 00:00:23,466
Go ahead and run this code verify the timings, they might be different, on
7
00:00:23,466 --> 00:00:28,520
whatever GPU and system you're running on. And then, add a new kernel which
8
00:00:28,520 --> 00:00:33,526
performs the transpose on a per element basis and see how it times out. Now, one
9
00:00:33,526 --> 00:00:38,454
thing to be aware of when you're doing this programming exercise, is that you
10
00:00:38,454 --> 00:00:43,228
can only launch up to a 1,024 threads in a thread block. So, I suggest you
11
00:00:43,228 --> 00:00:48,038
organize your code to use K by K thread blocks, multiple thread blocks, each of
12
00:00:48,038 --> 00:00:52,529
dimension K by K, and use a value like 16 for K for now. We're going to play
13
00:00:52,529 --> 00:00:55,126
around with this value a little bit later.
PK #B5d
I Lesson_5_-_Optimizing_GPU_Programs/17-Transport_Code_Example_Part4.en.srt1
00:00:00,012 --> 00:00:04,642
Okay, so here's my solution to that. I started by adding a constant in, k equals
2
00:00:04,642 --> 00:00:09,036
16, like we talked about. So now if you think about this, we're going to have
3
00:00:09,036 --> 00:00:13,752
thread blocks, each of which is responsible for a tile in the matrix. Let's draw
4
00:00:13,752 --> 00:00:18,012
that, we'll be processing this matrix in blocks called tiles, each tile
5
00:00:18,012 --> 00:00:22,604
corresponding to a single in the thread block. Here's the new kernel I wrote.
6
00:00:22,604 --> 00:00:27,160
Very similar to the previous one, the difference is that now I need to compute,
7
00:00:27,160 --> 00:00:31,579
I need to compute the in values. Not just the I values and that computations
8
00:00:31,579 --> 00:00:36,213
involves using the block index as well as the thread index. So the block index
9
00:00:36,213 --> 00:00:40,863
times k times the plus thread index plus x.x Gives my i value, some way i gate
10
00:00:40,863 --> 00:00:45,363
my j value from the y and so now the threads in my grid have a overall index x
11
00:00:45,363 --> 00:00:49,912
and y as far as index at tender block which is the thread index. So I am going
12
00:00:49,912 --> 00:00:54,225
to compute that x and y and that's going to correspond to the i and j of the
13
00:00:54,225 --> 00:00:59,378
output matrix and that's the calculation its going on here. And finally I do the
14
00:00:59,378 --> 00:01:04,354
same, the same final calculation that I've done all along. I go and grab an
15
00:01:04,354 --> 00:01:09,457
element from the input matrix, equal to the coordinates IJ, and I write that
16
00:01:09,457 --> 00:01:14,597
into the output matrix. The location of the output matrix corresponding to
17
00:01:14,597 --> 00:01:20,050
element JI, and here's how I'm going to call it. Now I need to define the number
18
00:01:20,050 --> 00:01:24,916
of blocks and the number of threads which is simply n over k, n over k. I'm
19
00:01:24,916 --> 00:01:30,031
being a little lazy here you can tell. I'm, I'm assuming that n is a multiple of
20
00:01:30,031 --> 00:01:35,218
k. There's k by k threads in a block. And now when I launch instead of launching
21
00:01:35,218 --> 00:01:40,504
a single thread I launch blocks threads. N over K by N over K and instead of
22
00:01:40,504 --> 00:01:46,668
launching 1024 elements or N elements I'm going to launch K by K elements per
23
00:01:46,668 --> 00:01:52,692
block. Let's go and compile and run this. Okay, now we're talking. So now we're
24
00:01:52,692 --> 00:01:57,472
down to about 0.67 milliseconds. Let's make a note. And thinking about what
25
00:01:57,472 --> 00:02:01,492
we've done here is we've, we've now extracted the maximum amount of parallelism
26
00:02:01,492 --> 00:02:05,472
from the application, right? We started with a single serial code, one thread
27
00:02:05,472 --> 00:02:09,476
doing all the work. We switched to a code that used one thread per row of them
28
00:02:09,476 --> 00:02:13,501
input matrix and now we're down to something that uses one thread per element.
29
00:02:13,501 --> 00:02:17,307
There's really no more parallelism that we can extract here. And you can see
30
00:02:17,307 --> 00:02:20,090
that increasing the parallelism has really helped us.
PK #Bۢ c Lesson_5_-_Optimizing_GPU_Programs/18-Maximum_Parallelism_Is_Not_Always_Best_For_Performance.en.srt1
00:00:00,012 --> 00:00:04,516
I want to make a point here that, it's a bit of a ninja topic. But, it turns out
2
00:00:04,516 --> 00:00:08,296
that exploding every last bit of parallelism isn't always the very best
3
00:00:08,296 --> 00:00:12,405
performing code. Sometimes it helps to do more work per thread. And this is,
4
00:00:12,405 --> 00:00:16,683
this leads to an advanced optimization technique called granularity coarsening
5
00:00:16,683 --> 00:00:21,112
that we'll talk about later. That said, the first problem is almost always to
6
00:00:21,112 --> 00:00:25,513
find enough parallelism. So keeping that in mind, are we done? Is this 0.7
7
00:00:25,513 --> 00:00:30,062
milliseconds the fastest that we can transpose this matrix on this GPU? Let's
8
00:00:30,062 --> 00:00:34,357
reason that out. So two things can limit your performance on any code. Time
9
00:00:34,357 --> 00:00:39,071
spent fetching and storing data from and to memory. Or time spent performing
10
00:00:39,071 --> 00:00:43,761
compute operations on that data. Now, the transpose code has almost no
11
00:00:43,761 --> 00:00:48,787
computation at all. It's entirely about moving data around. So let's ignore
12
00:00:48,787 --> 00:00:53,834
compute for the moment and focus on memory. My question is, are we moving that
13
00:00:53,834 --> 00:00:59,138
data around efficiently? How can we tell? There's a handy utility called device
14
00:00:59,138 --> 00:01:03,933
Query that's included in the CUDA SDK. Let's run it. Device Query spits out an
15
00:01:03,933 --> 00:01:08,091
enormous amount of information, and most of which you really don't need to know
16
00:01:08,091 --> 00:01:12,213
right now. But buried in here are a few things that I want to point out. The GPU
17
00:01:12,213 --> 00:01:16,369
clock rate is how fast the actual processors in the GPU are going. Memory Clock
18
00:01:16,369 --> 00:01:20,457
rate shows you how fast the memory in the GPU is operating. And the Memory Bus
19
00:01:20,457 --> 00:01:25,214
Width describes how many bits of memory are actually being transferred for each
20
00:01:25,214 --> 00:01:29,992
of these clock cycles. So from this we can actually figure out the maximum speed
21
00:01:29,992 --> 00:01:34,528
of the memory, the maximum band width, the maximum amount of data that we can
22
00:01:34,528 --> 00:01:35,869
transfer in a second.
PK #BG% G Lesson_5_-_Optimizing_GPU_Programs/19-Theoretical_Peak_Bandwidth.en.srt1
00:00:00,012 --> 00:00:04,965
So, here's the two important things from that whole device query. Memory clock,
2
00:00:04,965 --> 00:00:09,750
2,508 megahertz, memory bus, 128 bits wide. As a quick quiz, let's ask, what's
3
00:00:09,750 --> 00:00:14,763
the theoretical peak bandwidth? And I'd like the answer in gigabytes per second
4
00:00:14,763 --> 00:00:18,110
because that's how we'll usually hear it described.
PK #B G Lesson_5_-_Optimizing_GPU_Programs/20-Theoretical_Peak_Bandwidth.en.srt1
00:00:00,012 --> 00:00:05,850
So, 2508 megahertz is 2508 times 10 to the 6th clocks per second. Memory bus of
2
00:00:05,850 --> 00:00:11,203
128 bits is equal to 16 bytes per clock. So, if you multiply this out, the
3
00:00:11,203 --> 00:00:17,187
maximum theoretical peak bandwidth of the memory system is going to be just over
4
00:00:17,187 --> 00:00:21,671
40 gigabytes per second. And as a rough rule of thumb, anytime we achieve
5
00:00:21,671 --> 00:00:25,896
something like 40 to 60% of the memory bandwidth, we'd say, well, that's doing
6
00:00:25,896 --> 00:00:30,357
okay. It's not great, there's probably room for improvement. If you can get into
7
00:00:30,357 --> 00:00:34,418
the 60 to 75% range, that's doing pretty well. You might not be able to improve
8
00:00:34,418 --> 00:00:38,520
on that. And any time you get over 75%, we would consider that excellent, okay?
9
00:00:38,520 --> 00:00:42,288
You'll never achieve this theoretical peak bandwidth on any real substantial
10
00:00:42,288 --> 00:00:46,373
code, okay? This is literally just what you get from multiplying out the clock
11
00:00:46,373 --> 00:00:49,759
rate and the Memory bus. And a real code is going to have an additional
12
00:00:49,759 --> 00:00:53,649
overhead. So, if you can get over 75% of that, you're doing really well, and you
13
00:00:53,649 --> 00:00:57,639
probably don't need to optimize the memory further. So, how well are we doing in
14
00:00:57,639 --> 00:00:58,199
this code?
PK #B S Lesson_5_-_Optimizing_GPU_Programs/21-What_Bandwidth_Does_Our_Kernel_Achieve.en.srt1
00:00:00,012 --> 00:00:05,497
Let's make that another quiz. So, what bandwidth does our kernel achieve? And so
2
00:00:05,497 --> 00:00:10,348
remember, the last version I showed you ran with a value of N equals 1024 and I
3
00:00:10,348 --> 00:00:14,975
told you, it took 0.67 milliseconds. And once again, I like the answer in
4
00:00:14,975 --> 00:00:16,425
gigabytes per second.
PK #BQ S Lesson_5_-_Optimizing_GPU_Programs/22-What_Bandwidth_Does_Our_Kernel_Achieve.en.srt1
00:00:00,012 --> 00:00:05,883
Our transpose code is reading 1024 by 1024 elements. Each element has 4 bytes
2
00:00:05,883 --> 00:00:11,523
and we are both reading and writing them, so we are going to have to transfer
3
00:00:11,523 --> 00:00:17,424
those values across the bus twice, that's the total amount of memory. It's
4
00:00:17,424 --> 00:00:23,036
taking 0.67 milliseconds, which is about 1.25 times 10 to the 10 bytes per
5
00:00:23,036 --> 00:00:29,080
second, or scaled to gigabytes, about 12.5 gigabytes per second. So, we're not
6
00:00:29,080 --> 00:00:34,888
really coming that close to the theoretical peak bandwidth, we can probably do
7
00:00:34,888 --> 00:00:35,585
better.
PK #B9k
Y Lesson_5_-_Optimizing_GPU_Programs/23-Measuring_Memory_Usage_of_Our_Transpose_Code.en.srt1
00:00:00,012 --> 00:00:04,556
So, if we go through this analysis for all of these kernels, we'll see that our
2
00:00:04,556 --> 00:00:09,423
parallel per element version of the code achieves 12.5 gigabytes per second. Our
3
00:00:09,423 --> 00:00:14,022
parallel per row version of the code gets about 1.8 gigabytes per second. And
4
00:00:14,022 --> 00:00:18,708
our serial version in the code gets an abysmal 0.018 gigabytes per second, this
5
00:00:18,708 --> 00:00:23,320
is roughly the speed of a carrier pigeon. And a better way to think about this
6
00:00:23,320 --> 00:00:28,524
perhaps, is not in absolute numbers. That is a percentage of what the particular
7
00:00:28,524 --> 00:00:33,386
GPU we're using can achieve. So, if we were to work out the percentages we're
8
00:00:33,386 --> 00:00:38,174
achieving, it's something like 31% of theoretical peak bandwidth with our
9
00:00:38,174 --> 00:00:43,485
highest performing kernel. 4.5% peak bandwidth with our per row kernel, and less
10
00:00:43,485 --> 00:00:48,406
than a 10th of percent with our serial kernel. So, back to the question. Why is
11
00:00:48,406 --> 00:00:53,300
this number so low? Well, we can take a pretty shrewd guess that whenever you
12
00:00:53,300 --> 00:00:58,118
see really low DRAM utilization, really low percentage bandwidth, your first
13
00:00:58,118 --> 00:01:03,552
guess is always coalescing. A way to think about coalescing is that the GPU is
14
00:01:03,552 --> 00:01:09,113
always accessing global memory, accessing the DRAM in pretty larges chunks, 32
15
00:01:09,113 --> 00:01:13,937
or 128 bytes at a time. And this means that we are going to need the fewest
16
00:01:13,937 --> 00:01:19,377
total memory transactions when the threads in a warp access contiguous adjacent
17
00:01:19,377 --> 00:01:24,611
memory locations. So, this is an example of good coalescing. Every thread is
18
00:01:24,611 --> 00:01:29,190
either reading or writing an adjacent memory location. And clearly, if the
19
00:01:29,190 --> 00:01:33,406
threads in a warp are reading and writing completely random locations and
20
00:01:33,406 --> 00:01:38,186
memory, then you're going to get poor coalescing, right? So, if these accesses
21
00:01:38,186 --> 00:01:43,037
are spread out all over the memory, then the total number of chunks of memory
22
00:01:43,037 --> 00:01:48,152
that we have to read could be as large as the number of threads in the warp. So,
23
00:01:48,152 --> 00:01:52,615
random access pattern clearly leads to bad coalescing. So, a much more common
24
00:01:52,615 --> 00:01:56,893
access pattern is what's called Strided, and this is where threads access
25
00:01:56,893 --> 00:02:01,792
allocation memory that's a function of their thread ID times some, some stride.
26
00:02:01,792 --> 00:02:07,453
So for example, thread 0 might access location 0, thread 1 location 2, thread 2
27
00:02:07,453 --> 00:02:12,758
location 4, thread 3 location 6, and so on. In that case, that would be a stride
28
00:02:12,758 --> 00:02:17,864
of 2, because there's two elements between thread accesses. And strided access
29
00:02:17,864 --> 00:02:22,139
is range from, okay, like in this case where with the stride of, of two
30
00:02:22,139 --> 00:02:27,166
elements, and really only doubling the number of memory transactions. So, I'm
31
00:02:27,166 --> 00:02:32,017
sort of having the quality of my, my coalescing all the way to really, really
32
00:02:32,017 --> 00:02:36,819
bad, right? So, you can imagine that at the stride between elements is large
33
00:02:36,819 --> 00:02:41,759
enough than every thread in the warp is accessing a completely different 32 or
34
00:02:41,759 --> 00:02:46,020
128 byte chunk of memory. And then, you're, you're guaranteed to get bad
35
00:02:46,020 --> 00:02:50,990
behavior. Guaranteed to be maximizing the number of memory transactions that you
36
00:02:50,990 --> 00:02:55,386
have to do. So, let's look at the code for our kernels. Here's where we're
37
00:02:55,386 --> 00:02:59,871
reading from the input matrix. And this actually works out pretty well. Every
38
00:02:59,871 --> 00:03:04,551
thread is reading a value in memory which is equal to some large offset, j times
39
00:03:04,551 --> 00:03:09,029
N plus i. And if you look at i, i is really the thread index plus some offset.
40
00:03:09,029 --> 00:03:13,256
So, adjacent threads, you know, threads with adjacent thread in the season x are
41
00:03:13,256 --> 00:03:17,594
reading adjacent values of the input matrix. That's exactly what we want. So,
42
00:03:17,594 --> 00:03:21,873
this is good coalescing. On the other hand, when we write the output matrix,
43
00:03:21,873 --> 00:03:26,082
adjacent threads, threads with adjacent values of i are riding to places
44
00:03:26,082 --> 00:03:31,040
separated in memory by n, right? And N was like 1024. So, adjacent threads are
45
00:03:31,040 --> 00:03:36,247
running memory locations that are 1024 elements away from each other. This is
46
00:03:36,247 --> 00:03:41,471
clearly bad, this is bad coalescing. Bad, bad, bad, bad. This is, in fact, the
47
00:03:41,471 --> 00:03:42,881
root of our problem.
PK #BV Q Lesson_5_-_Optimizing_GPU_Programs/24-Most_GPU_Programs_Are_Memory_Limited.en.srt1
00:00:00,012 --> 00:00:04,274
This is a really important optimization pattern. So, let me emphasize it. In
2
00:00:04,274 --> 00:00:08,807
practice, most well-tuned GPU codes are memory limited. I'll repeat that. Most,
3
00:00:08,807 --> 00:00:13,140
not all, but most GPU codes are memory limited. So, always start by measuring
4
00:00:13,140 --> 00:00:17,567
your achieved bandwidth to see if you're using memory efficiently. And if not,
5
00:00:17,567 --> 00:00:18,911
ask yourself, why not?
PK #BS T Lesson_5_-_Optimizing_GPU_Programs/25-Tools_to_Measure_Bandwidth_Utitlization.en.srt1
00:00:00,012 --> 00:00:03,596
It's important to be able to reason about this the way that I just described to
2
00:00:03,596 --> 00:00:07,222
you, right? So, we sort of walked our way through. We figured out what kind of
3
00:00:07,222 --> 00:00:11,001
bandwidth we were getting, and what percentage of theoretical peak that was. We
4
00:00:11,001 --> 00:00:14,364
saw that it was really quite low, and we said, why would we be getting low
5
00:00:14,364 --> 00:00:18,002
bandwidth to global memory? Well, the first thing you always look at there is
6
00:00:18,002 --> 00:00:21,607
coalescing. And then we inspected the code and convinced ourselves that yes,
7
00:00:21,607 --> 00:00:25,798
there's bad coalescing happening when we write to the output matrix. But, I also
8
00:00:25,798 --> 00:00:30,141
want to make the point that you don't have to, do this from scratch every time.
9
00:00:30,141 --> 00:00:33,965
Right? Doing all these calculations is a little bit like rubbing two sticks
10
00:00:33,965 --> 00:00:37,805
together to start a fire, it's good to know how, but there are tools to help you
11
00:00:37,805 --> 00:00:41,568
do this. The tool that we're going to be using is called nSight. This is an
12
00:00:41,568 --> 00:00:46,329
Nvidia product, there's also third-party products, maybe I'll give some links to
13
00:00:46,329 --> 00:00:50,879
those in supplementary material. And if you're using Linux or a Mac, like I'm
14
00:00:50,879 --> 00:00:55,315
using, then you'll be using the nSight Eclipse edition. If you were using
15
00:00:55,315 --> 00:01:00,017
Windows, you'd by using nSight Visual Studio edition. These are integrated
16
00:01:00,017 --> 00:01:05,474
debuggers and profilers, they're full, full blown development environments. The
17
00:01:05,474 --> 00:01:10,356
part that we're going to use, is called the Nvidia Visual Profiler, or NVPP.
18
00:01:10,356 --> 00:01:11,925
Let's fire that up now.
PK #B2U~ = Lesson_5_-_Optimizing_GPU_Programs/26-Using_NVVP_Part1.en.srt1
00:00:00,012 --> 00:00:03,396
So, this is NVVP. It's a very powerful tool. It has a lot of functionality.
2
00:00:03,396 --> 00:00:06,728
We're only going to touch on it today. My real point is simply to point out to
3
00:00:06,728 --> 00:00:10,028
you that these tools exist and that even though in this class we're going to
4
00:00:10,028 --> 00:00:13,768
focus on understanding the concepts, and being able to reason about these things
5
00:00:13,768 --> 00:00:17,343
from scratch, I do want you to understand that, that you don't have to always do
6
00:00:17,343 --> 00:00:21,312
this from scratch, so that you should use the tools that are available. So, the
7
00:00:21,312 --> 00:00:25,257
first thing we're going to do is pick an executable to run on. And so, this is
8
00:00:25,257 --> 00:00:29,411
where we tell it what executable to run, we're going to run our transpose code.
9
00:00:29,411 --> 00:00:33,638
Before we do that, I'm actually going to go into the transpose code and comment
10
00:00:33,638 --> 00:00:38,328
out the section of it that is running the transpose serial kernel, that first
11
00:00:38,328 --> 00:00:42,517
kernel that we figured out. That one is so slow that I don't really want to
12
00:00:42,517 --> 00:00:46,972
spend any more time on that. And as you'll see, the profiler takes a little time
13
00:00:46,972 --> 00:00:51,503
to run the code several times and acquire statistics about it. So, I comment
14
00:00:51,503 --> 00:00:57,044
that out and say that I go back and I recompile my code. And now, I'm ready to
15
00:00:57,044 --> 00:01:05,410
go back to NVVP, tell it what executable we're going to run, and here we go. It
16
00:01:05,410 --> 00:01:13,889
runs the program, it measures all of the calls to the CUDA API, things like
17
00:01:13,889 --> 00:01:20,069
malloc and memcpy. And it measure the time taken for every kernel. Here's the
18
00:01:20,069 --> 00:01:25,515
transpose parallel per row kernel. Let me Zoom In on that. You can see that the
19
00:01:25,515 --> 00:01:30,456
parallel per row kernel started a 110 milliseconds into the executional
20
00:01:30,456 --> 00:01:36,342
programs, finished to 119 milliseconds. End of the program, ran a single block
21
00:01:36,342 --> 00:01:41,527
that a grid size of 111 so it launched to a single block, that block had 1024
22
00:01:41,527 --> 00:01:47,185
threads. You can get a bunch of other statistics from this. If we scroll to the
23
00:01:47,185 --> 00:01:52,248
right, we can see the much shorter time taken by our transpose per element
24
00:01:52,248 --> 00:01:57,270
kernel. I'll Zoom In even further. And this one, as you see, ran in a grid, in a
25
00:01:57,270 --> 00:02:02,063
larger grid of 64 by 64 blocks, each of which had 16 by 16 threads. Now, I'll
26
00:02:02,063 --> 00:02:07,249
make the point that we measure a shorter time than that. And that the times that
27
00:02:07,249 --> 00:02:11,848
you see in the profile are not going to match, match up exactly to the timings
28
00:02:11,848 --> 00:02:16,666
that we measured a little bit ago, when we were outside the profiler, and that's
29
00:02:16,666 --> 00:02:17,655
to be expected.
PK #B?c< = Lesson_5_-_Optimizing_GPU_Programs/27-Using_NVVP_Part2.en.srt1
00:00:00,012 --> 00:00:04,989
But the interesting thing that we can do in something like NVBP is we can go in
2
00:00:04,989 --> 00:00:10,304
and we can anaylze the program. So this is going to run the program many times,
3
00:00:10,304 --> 00:00:15,309
it's collecting a bunch of statistics about the program, it's averaging them
4
00:00:15,309 --> 00:00:20,341
together. And now that we've done this analysis we have a lot more information
5
00:00:20,341 --> 00:00:25,329
about these kernels. So if I click on this you see, I have more statistics over
6
00:00:25,329 --> 00:00:29,432
here. And the one that I want to highlight for you is the global load
7
00:00:29,432 --> 00:00:34,541
efficiency, which tells us how efficient our global loads were. In other words,
8
00:00:34,541 --> 00:00:39,213
of all of the bytes that we fetched with each memory transaction, =how many of
9
00:00:39,213 --> 00:00:43,744
them are actually useful? 100%, looks pretty good. That's what we would expect
10
00:00:43,744 --> 00:00:47,964
from fully coalesced accesses. The global store efficiency, so now our, our
11
00:00:47,964 --> 00:00:51,924
stores to global memory, which in our case is writing the output matrix,
12
00:00:51,924 --> 00:00:56,404
achieved 12.5%, and that's pretty wretched. Right, and that's again what we
13
00:00:56,404 --> 00:01:00,948
would expect from having inspected the code. Our total D ram utilization is down
14
00:01:00,948 --> 00:01:05,236
at 7.6%, again remember this is not going to exactly match what we calculated
15
00:01:05,236 --> 00:01:09,672
outside the profiler but we can tell there's a problem. That was the parallel
16
00:01:09,672 --> 00:01:14,274
per row kernel but only had a single thread block. We can see that the parallel
17
00:01:14,274 --> 00:01:19,484
per-element kernel doesn't do much better. It has slightly higher global store
18
00:01:19,484 --> 00:01:24,111
efficiency. Very slightly higher DRAM utilization. So, we still got this
19
00:01:24,111 --> 00:01:28,794
problem. Our, our problem is clearly our ability to, to write to the output
20
00:01:28,794 --> 00:01:33,748
matrix is hampered. We're not achieving the bandwidth that we ought to. As I
21
00:01:33,748 --> 00:01:38,356
said before I'm not intending to go through all the many, many things that you
22
00:01:38,356 --> 00:01:43,007
can analyze in NVVP. I'll pull it up once or twice again in the course of this
23
00:01:43,007 --> 00:01:47,977
lecture to just illustrate that there are tools to help you figure out what's
24
00:01:47,977 --> 00:01:52,232
going on. You can see in fact that, that down here It's analyzing the program
25
00:01:52,232 --> 00:01:55,851
for you and giving you some suggestions. You know, saying look the
26
00:01:55,851 --> 00:02:00,011
multiprocessors in your program are mostly idle, you're not getting a lot of
27
00:02:00,011 --> 00:02:04,483
work done for the total amount of time this program runs. And your total compute
28
00:02:04,483 --> 00:02:08,885
to mem, mem copy efficiency is low, in other words, you're not doing a lot of
29
00:02:08,885 --> 00:02:13,646
computation, given the amount of time that you spend doing mim copies. Here's
30
00:02:13,646 --> 00:02:18,434
the mem copies in our timeline. You can see that we spent, you know, 2 point,
31
00:02:18,434 --> 00:02:23,131
let's see, 2.6 milliseconds copying information in and then, you know, 8
32
00:02:23,131 --> 00:02:28,391
milliseconds processing it and then another 2.6 milliseconds copying it out. So
33
00:02:28,391 --> 00:02:33,332
it warns us, hey, the total compute to mem copy efficiency is low. So these are
34
00:02:33,332 --> 00:02:37,598
really useful tools, you should use them. You should know about them. But we're
35
00:02:37,598 --> 00:02:41,438
not going to rely on them too much in this class because our point is to teach
36
00:02:41,438 --> 00:02:45,576
you how to reason about things from, from first principles. So our problem is
37
00:02:45,576 --> 00:02:49,601
bad coalescing on the write to the output matrix. What can we do about that?
PK $B
t t 3 Lesson_5_-_Optimizing_GPU_Programs/28-Tiling.en.srt1
00:00:00,012 --> 00:00:03,386
So, our problem is that we're achieving coalesced reads but scattered writes.
2
00:00:03,386 --> 00:00:06,638
And our goal is to achieve coalesced reads and coalesced writes. Clearly,
3
00:00:06,638 --> 00:00:09,851
swapping the order of the reads and writes wouldn't help, because then we'd
4
00:00:09,851 --> 00:00:13,312
simply have scattered reads and coalesced writes. So, how do we achieve both of
5
00:00:13,312 --> 00:00:16,789
these? The solution is going to be a, a, a really important general strategy, so
6
00:00:16,789 --> 00:00:21,006
I'll spend a little time on it. The idea is that we're going to take a tile of
7
00:00:21,006 --> 00:00:26,062
the input matrix at a time, and we're going to transpose it and copy it into its
8
00:00:26,062 --> 00:00:30,890
transposed location in the output matrix. And this is going to be the job of a
9
00:00:30,890 --> 00:00:35,526
single thread block, so the threads in this thread block are going to work
10
00:00:35,526 --> 00:00:40,618
cooperatively together to perform this copy and transpose of a tile of elements
11
00:00:40,618 --> 00:00:44,682
at a time. The threads in the thread block are going to copy the tile into
12
00:00:44,682 --> 00:00:49,128
shared memory, a shared memory belonging to that thread block. They're going to
13
00:00:49,128 --> 00:00:53,549
perform the transpose. So now, the transpose of these elements, the elements in
14
00:00:53,549 --> 00:00:58,070
this tile is happening in shared memory where you don't pay that tremendous cost
15
00:00:58,070 --> 00:01:02,182
that you see in global memory to, to do a, a scattered write or read. And
16
00:01:02,182 --> 00:01:08,882
finally, the threads in the thread block will work together to copy the elements
17
00:01:08,882 --> 00:01:14,882
out and the key is this. If our tile is large enough, say, it's a K by K tile,
18
00:01:14,882 --> 00:01:20,921
say, the K is 32, in that case, each warp will copy out a chunk of 32 elements
19
00:01:20,921 --> 00:01:26,262
at a time into shared memory. And because all 32 threads in that warp are
20
00:01:26,262 --> 00:01:31,430
reading and writing in adjacent locations in memory, you'll get good coalescing
21
00:01:31,430 --> 00:01:36,066
and then, when you copy the transpose matrix to its new location in global
22
00:01:36,066 --> 00:01:40,984
memory, you can once again do it in a coalesced fashion. So, let's try this as a
23
00:01:40,984 --> 00:01:46,032
programming exercise. We'll give you the start of the tile transpose kernel and
24
00:01:46,032 --> 00:01:50,372
the code that calls it and you should modify the kernel code to declare it an
25
00:01:50,372 --> 00:01:54,852
array in shared memory and copy appropriate elements in and out of the shared
26
00:01:54,852 --> 00:01:59,675
memory, set the final elements are written out in transpose fashion. And don't
27
00:01:59,675 --> 00:02:04,148
forget to add any syncthread barriers that you need to make sure that you get
28
00:02:04,148 --> 00:02:05,365
the correct result.
PK $Bω5 3 Lesson_5_-_Optimizing_GPU_Programs/29-Tiling.en.srt1
00:00:00,012 --> 00:00:04,832
So here is my code for this. I'll start by setting k equal to 32 and here is the
2
00:00:04,832 --> 00:00:09,916
actual code. I begin by figuring out the locations of the tile corners, okay.
3
00:00:09,916 --> 00:00:14,584
This is going to tell me where I need to start writing in the output and start
4
00:00:14,584 --> 00:00:19,912
reading from the input so just a little book keeping and giving things variable
5
00:00:19,912 --> 00:00:25,390
names that mean something to me. But as you can see the, the place where I start
6
00:00:25,390 --> 00:00:31,186
reading the i value is a function of which block were in times the width of the
7
00:00:31,186 --> 00:00:36,730
tile because each tile is responsible for one block and that's the case, and the
8
00:00:36,730 --> 00:00:41,938
j value is the same but in y, in y instead of x, and the output simply inverts y
9
00:00:41,938 --> 00:00:47,207
and x. Okay, so now that I know where I need to read to write my tile. I'm going
10
00:00:47,207 --> 00:00:52,060
to want to know which element of the tile to read and write from. And just a
11
00:00:52,060 --> 00:00:57,167
shorthand to make the code a little more readable. I'm going to set x to thread
12
00:00:57,167 --> 00:01:02,370
index .x, and y to threat index y. So now, the code itself is really simple. I
13
00:01:02,370 --> 00:01:07,578
declare my floating point array in shared memory k by k array of tiles. And I
14
00:01:07,578 --> 00:01:12,744
read from global memory, and write the result into shared memory. So here's my
15
00:01:12,744 --> 00:01:17,704
read from global memory and its function of, where I start in the, where the
16
00:01:17,704 --> 00:01:22,984
tile starts plus which thread I'm responsible, or which element this thread is
17
00:01:22,984 --> 00:01:27,147
responsible for in the tile. To avoid an extra sync threads I'm going to go
18
00:01:27,147 --> 00:01:31,298
ahead and write this into shared memory in transposed fashion. So it's not tile
19
00:01:31,298 --> 00:01:35,371
x y, it's tile y x. Okay, that, that saves one of these sync ,threads barriers.
20
00:01:35,371 --> 00:01:39,255
Now I've got the transpose tile sitting in shared memory. It's already been
21
00:01:39,255 --> 00:01:42,976
transposed, and I want to write it out to global memory, and I want to write it
22
00:01:42,976 --> 00:01:46,880
out in coalesced fashion, so adjacent threads write adjacent locations and
23
00:01:46,880 --> 00:01:51,308
memory. In other word, adjacent threads are varying by x the way I've set this
24
00:01:51,308 --> 00:01:55,874
out. So, here's my write to global memory after my read to from a shared memory.
25
00:01:55,874 --> 00:02:00,161
You could have done this in two sync threads. You could have read this in the
26
00:02:00,161 --> 00:02:04,717
shared memory, perform the transpose and written it out to global memory and you
27
00:02:04,717 --> 00:02:09,137
have needed a sync threads after reading it in to shared memory and again after
28
00:02:09,137 --> 00:02:13,985
performing the transpose. So, if you did that I encourage you to go back and
29
00:02:13,985 --> 00:02:19,050
convert it to this single version and see how much faster it goes. Let's go
30
00:02:19,050 --> 00:02:24,345
ahead and run this on my laptop. Okay, there's two interesting things to note
31
00:02:24,345 --> 00:02:29,894
here. One is that the amount of time taken by the parallel per element code, the
32
00:02:29,894 --> 00:02:35,263
kernel that we had before, actually went up. It's almost twice as slow now as it
33
00:02:35,263 --> 00:02:40,268
was before and if you think about it this code didn't change at all, except that
34
00:02:40,268 --> 00:02:44,930
we changed the value of k. We changed the size of the thread block that's being
35
00:02:44,930 --> 00:02:48,984
used in this code. We're going to come back to that. That's going to give us a
36
00:02:48,984 --> 00:02:51,969
hint as to a further optimization. In the meantime,
37
00:02:51,969 --> 00:02:56,937
transposeparallelperelementtiled, our new version is a little bit faster,
38
00:02:56,937 --> 00:03:01,549
not a lot faster and that's kind of disturbing. We should have gone to perfectly
39
00:03:01,549 --> 00:03:06,003
coalesced loads and stores which should have made a difference. Let's go ahead
40
00:03:06,003 --> 00:03:08,538
and fire up NVVP again and see what happened.
PK $B