NokiMo
MooresLawIsDead
MooresLawIsDead

patreon


Presentations from Hot Chips 2023 w/ KarbinCry - Die Shrink Telegrams

The next episode of Die Shrink will see KarbinCry join yet again to discuss all of the presentations, announcements, and technology discussed from Hot Chips this year. Presentations and subjects that stood out (to us) include:

What stood out to you?  What do you have questions about? You have ~16 hours (till tomorrow afternoon US Central Time) to submit, and you also have access to all of these presentations at the following link to think of questions:
https://drive.google.com/drive/folders/1k1duE4Ps6tJpbwHtrWGYu6s4frnLGpDH

Comments

I disagree that it was undoubtedly the coolest, but you know I'm an odd duck ;)

KarbinCry

I hope Intel are putting more effort into Sierra Forest than their copy-paste presentation. I think it has a good chance of being Intel's most power efficient server processor by a long way, but that's quite a low bar to clear. I doubt it'll beat Bergamo on a per-socket basis in performance but maybe it can get close on perf-per-watt. I don't see it beating Bergamo on cost either. And Turin-Dense might double performance again.

Chris Rijk

How much energy does the AIE in phoenix use? It's 10 tflops of Int8 don't seem particularly impressive compared to the integrated GPU's 8.6 tflops of FP16, but maybe its just way more efficient?

Cache Miss

Another thing I thought was really interesting was the completely opposite sentiment of NVIDIA (and most other AI presenters) and qualcomm. Everyone else is moving towards ever increasing sparsity, whereas Qualcomm claims that sparsity is essentially a worse form of quantization. Asking after the presentation, he said that, essentially, sparsification is like quantization except everything around zero is compressed down to zero. So essentially around zero you get half a bit of precision. When I asked about how there is overparametrization in models, he basically responded that more and more modern designs are becoming ever more efficient in terms of using their parameters, so I think basically he thinks sparsification is a temporary solution that might just go away. I'm not really qualified to speak on this stuff, but I think that its interesting how strongly divergent those opinions are. Then again, NVIDIA also kind of contradicted itself, as they presented on the logarithmic number format (which why has this not been a thing till now? Seems so clean and obvious, although I guess addition is a bit hard, but you still have to solve that problem in FP), which seems designed to preserve even more accuracy around zero. There was a question about that seeming dichotomy, and I don't think the answer given was good. I'm a bit confused on that one. Maybe they were essentially just talking for different applications that care about precision around zero, use this, and other models are better sparsified?

Cache Miss

I thought the AMD (xilinx) FPGA talk was pretty interesting just because of the comparison to Sapphire rapids. Both are a 2x2 package, and both have two types of dies (iirc from the presentation). There was a lot of talk on the discord about how Intel's design is overly complicated, where they had to create two types of dies to create this chip and AMD just needs one and all that. Basically, if only Intel didn't overcomplicate their chips, maybe they could come out on time. Yet here is Xilinx basically saying, "yeah no to get enough bandwidth/latency we kinda need to do that. Yes it's hard, but its necessary." The funny thing though is that Xilinx and Intel have taken their paths in completely opposite order. Xilinx went from a kind of linear 1x4 tiling to a 2x2, whereas intel went from 2x2 to 1x3 with granite rapids. Guess Intel's just not as good at packaging as Xilinx lol (but also a big part of it may be that intel has those separate IO dies, whereas Xilinx still integrates IO---so intel achieved a similar IO profile to Xilinx with IO at the top and bottom but with this linear tiling).

Cache Miss

Is there any technical or business reason Zen 4c (or 5c) area efficiency focused chiplets cannot be combined with 3D V-cache? I believe in your Zen 5 leaks, you named cost as the leading factor for why AMD would choose not to do that. As costs come down over time, is this something that might arrive later? Do they have something working in the lab?

Nicholas Buckner

Undoubtably the coolest presentation was the one from Fabric8Labs, who applied some nice lateral thinking to come up with a new manufacturing technique for creating complex 3D structures. I don't expect it to be particularly cheap but let's hope for some crazy PC cooling designs.

Chris Rijk

I'm sure DRAM makers would like to be able to sell premium DRAM chips that include processing as well. It's early days still but do they have a chance? I think they could find a niche with very large models that have simple processing but I'm not sure about other scenarios.

Chris Rijk

AI is generally split into training and inference. My overall sense from the AI talks is that cost effective and efficient inference is where most of the focus is now. Nvidia seems likely to dominate in training but is inference more of an open race?

Chris Rijk

Hi Tom and KarbinCry Thinking about APUs and how there's a penalty for working with DDR memory instead of GDDR. I've heard the upgrade to DDR5 has helped with this a bit, but I wonder if 3D stacking could be used to further alleviate issues. So far we have only seen V-Cache on CPUs, but is there any reason AMD couldn't do something similar for GPUs too? Would adding a big pool of cache potentially offer bigger benefits to a large APU like Strix Halo vs a dGPU? Or is the 32MB of infinity cache and 256 bit bus on Strix Halo already sufficient? Thanks and keep up the good work.

Deadeyes

Hi Tom and Karbin, do you think processing in memory will see widespread use in the next couple generations of HPC or even client products? It seems like the best of all worlds for AI processing.

Gach

I will answer here because a question as long as this won't get in, not fully. We will probably discuss this talk, but in less detail. Graph processors have been a thing for decades, in supercomputing. In that sense this isn't something new or revolutionary, as a uarch. CPU prefetch on x86s designs is rather primitive, I agree with that. But look at what ARM, Apple, and now even some players in RISC-V are doing with length of prefetch and branch prediction. Because they want wide designs with shallow pipelines, they need to look ahead much more to get decent clocks. And that includes much more complex and adaptive prefetch schemes. There is no reason x86s couldn't implement these features - it's just that they already evolved their pipelines to be long, yet effective, so they didn't need to rely as much on prefetch and branch predict - yet. GPU is a shitton of threads, but for a comparison, you should look at how many waves can be scheduled in a SIMD - RDNA3, for example, has I believe 16 waves that can be scheduled. Well, that's sort of like "SMT16" in the way GPUs use these "threads", and in the way the Intel chip in question is using them. I know, the terminology is weird since GPU thread is something completely different :) For a less confusing example, look at Larrabee and its result, Xeon Phi. There, Atom cores have SMT4. Not to actually simultaneously execute, but to hide latency by tagging in threads that are ready, while others wait for data. It is similar here. Most CPUs use 64B cache lines. 128B is very rare, I think Apple does use those at least on their M class silicon. Issue with 8B lines is that it will severely impact effective BW. Memory is built and gets its BW from massive internal parallelism, which is difficult to exploit. As a result, that DDR5-4400 may, in effective BW, be closer to say 2000. But, because here the workload is so specific, the effective BW in that workload can still be improved with 8B lines. I hope the explanation makes some sense. Locality challenges in graphs are a big deal. That's why many graph accelerators are designed to let compiler deal with those, not HW caching. For the most extreme option there, look at Groq's silicon (presented at last year's HC), or this year's North pole from IBM. Or... this Intel chip. Which relies on large scratchpads, i.e. SW managed caches. As for efficiency, it's a combination of best case workload and extreme focus on that very small sliver of workloads. The optics are interesting, but not for efficiency - not until we are talking about thousands of nodes. Where the co-pavkaged optics are interesting to me is that they are extending the NoC directly. Not like UPI where traffic goes through a single link, but like how SPR XCC is four tiles sharing a seamless mesh NoC carried over EMIB. Only here, that mesh is seamlessly extended via fiber out to distances in meter range. The one issue there is latency of Tx and Rx, of moving from an electronic to a photonic signal. Otherwise, I think they said a meter of length adds just 1ms. And that Tx/Rx latency is likely solvable issue, since we also saw Hummingbird, where photonic waveguides are used directly as a NoC for electronic components. In that case it's silicon wavecuiges and not fiber, but I do think there is a lot of space to crush down the latency even in fiber applications like this.

KarbinCry

It's been years since KarbinCry wrote his article on Intel's interconnect problem, how do things stand today? Has the interconnect problem been solved or has it just been surpassed by bigger problems?

qhfreddy

When do you think we will really start to feel the bandwidth crunch in the server market? Maybe we are already past that point? AMD has made good money already selling parts with only a few cores per CCX active which not only gives a lot of cache per core, but the extra bandwidth. Memory channel counts have also been going up going from 6 to 12 far quicker than from 4 to 6, which could just be because 2 DIMMs per channel is effectively useless for DDR5, but it also pushes a lot more bandwidth into the package.

qhfreddy

Hello Tom an Karbin. So I know this is not a topic of hot chips but since we are talking about chips and architectures I might as well ask it. What is the difference between Nvidia's and AMD's ray tracing approach? I read on internet that AMD's ray accelerators only take some load and the rest is on the CU's. Nvidia on the other hand has hardware that's focused on taking all of the raytracing load but not being able to do much else. So my questions are: what are the differences? What do you think is a better design strategy into the future? And how do you think AMD can improve Ray Tracing in the RDNA4?

QuickJumper

KarbinCry, how is your hand? I tried scrolling the epically long hot-chips-35 discussion in reverse and saw the bandaged pinky and for a split second my mind went to Emacs pinky. I hope you are doing well.

Nicholas Buckner

TLDR: I thought the "The First Direct Mesh-to-Mesh Photonic Fabric" was a pretty cool talk, and I was impressed at how they had a super simple approach to solve a problem that is extremely inefficient on modern CPUs. This was my first time attending hot chips (& I was in person!), so that was definitely cool. I'm not sure this was the most interesting presentation, and I definitely want to mention some others, but I found the presentation on "The First Direct Mesh-to-Mesh Photonic Fabric" pretty interesting. Honestly, part of that reason is simply how, well, simple the architecture is. Graph processing is something that kind of works terribly on modern processors, as they've invested all their points into optimizing essentially (somewhat) "regular" structures, whereas graphs are anything but regular. CPUs have all this logic to hide RAM latency of looking ahead in instructions and doing all that, but that is kind of useless when you have very long chains of dependent instructions, which is something very possible in complex graphs. Additionally, CPUs have kind of a dumb form of prefetching, where every time a CPU needs something from memory, it doesn't just fetch that bit of memory, but a lot more with the hope that nearby memory will be accessed later in the computation---it has to fetch a full "cacheline." If all you're doing is following a graph, then you often don't need that much memory---you just want a pointer for where to go next. Thus, graphs kind of, well, suck. In summary, graph processing is heavily latency constrained, and thus all these fancy features of modern cpus that burn lots of energy in order to hide memory latency are pretty much wasted---the chain of dependencies is just too long for the CPU to be able to look ahead of it. Therefore, what you really want is just a metric buttload of threads, each very simple and just fetching very small chunks of memory at a time, enough to know where to go next on the graph. You might say, "well, a GPU is just a shitton of threads," but the problem is these threads are (mostly) not independent---each thread may need to do somewhat different things. Additionally, the cacheline problem is even bigger on GPUs---cachelines are 128 bytes (or so I heard at one point). Therefore, what intel did was basically to just undo basically all of these regularization/fancy features done to try to make processors seem faster than they are. They simply created a chip with a bunch of threads that independently do pointer chasing. With each group of 16 super simple threads comes a stronger single thread oriented thread, as there are some pretty much pre/post computation steps that actually need some single threaded performance. Additionally, they reduced the cacheline size to just eight bytes, which practically means that they can make 8x better use of memory bandwidth. I wonder how difficult it is to make a memory controller that suddenly needs to have 8x more requests, but they were able to run DDR5-4400 memory, which is pretty impressive for that, and this is just a prototype for now, so I wouldn't be surprised if it gets faster. However, this in itself is really not enough to do graph processing properly. The thing with graph processing is that these graphs can be highly nonlocal. I'd imagine typical graphs do have some locality properties, but these are very weak, and you might need to access a part of the graph on a completely different location. When you have a large enough graph that it cannot fit in local memory, this becomes a huge issue. Therefore, the second part of this equation is some really really fast and low latency interconnects that essentially mean you can (almost) just access memory on any other processor in a large cluster of processors. Its just in a lab for now, but, upon being asked, they claimed that they can achieve somewhere on the order of 1000x power efficiency gains over traditional CPUs, which is pretty cool. These efficiencies only really apply to massive datasets (where stuff doesn't fit in L3/HBM cache & interconnects are a big issue), and so really these are projected numbers, as they only have like 2 chips running in a lab I wonder how much of this efficiency is just from the optical interconnect rather than the chip.

Cache Miss

Greetings Tom and Karbincry. Karbincry, great job with the HC coverage this year even if 90% of it flew over my head. My question to you both is what was the most interesting or surprising thing you saw/heard in the presentation? Cheers!

Xavbeat03

Hello there Tom and KarbinCry. Xtx owner here, I just have a question for karbinCry, where or how have you amassed so much knowledge on these topics. In my opinion the best part of MLID discord is the amount of knowledge and info that flows in there and KarbinCry is one of if not the pillar of it all. Half the time the information and knowledge is so above me I’m too scared or embarrassed to even read the discussion. Where would you recommend someone start so I have something else besides my xtx ownership to be proud of.

Sad XTX 999

What would you like to see and what do you thinks going to happen with the extra r&d capital from the arm ipo

paragon


Related Creators