Saturday, August 19, 2023

What I want out of computer hardware reviewers

Since this is apparently becoming an increasingly sporadic and PC building-focused blog, I feel compelled to comment on the recent controversy surrounding LTT and Linus Media Group's hardware reviews and other practices. GamersNexus' video lays it all out nicely.

First, some quick takes on the controversy:

  • Full disclosure: I'm a huge admirer of what LMG has built and, in general, the way they've grown and run their business. Building what Linus and his team have built is no insignificant achievement and the rising tide they've created in the tech YouTuber space has risen a lot of boats.
  • While I may not agree with every position he takes or decision he makes, I believe Linus to be a highly ethical person who operates from a strong personal moral compass. Again, his compass and mine don't align 100% of the time, but I'm saying I think he is a scrupulous dude.
  • That being said, I do think LMG's 'quantity over quality' approach is leading to many of the errors and questionable behavior that Steve is talking about. As the LMG team says for themselves, that strategy probably made sense as LMG was growing, but it's not clear that it's necessary or optimal now that the company is worth over $100mm.
  • Being that big creates an obligation for LMG to recognize that its actions and mistakes can have a massive impact on smaller partners, businesses and other creators. This is the focus of GN's criticisms in the second part of the video and the part that resonates most deeply with me.
  • Parenthetically, this sort of takedown piece is very on-brand for GN. There's a lot GN does that I find valuable, but the 'self-appointed guardian of ethics in the PC hardware community' shtick wears thin sometimes.
What I find more interesting is the thread of the discussion (addressed in the GN video, LMG's reply and this one from Hardware Unboxed) about hardware review and testing practices. GN and Hardware Unboxed, among many others, trade on the accuracy and rigor of their testing. LTT Labs is an attempt to do the same thing and bring LMG into that space. These outlets develop elaborate testing practices. They conduct dozens of benchmarks and hundreds of test runs for significant hardware releases. They have strong opinions about testing methodology and boast of their experience.

The LMG controversy has me wondering how valuable that work is, though, even to PC building enthusiasts. It's got me thinking about what I actually care about when, say, a new generation of CPUs or GPUs comes out; or when an interesting new piece of hardware is released.

Specifically, I'm talking about what sort of testing is useful, particularly in the context of day-one reviews. This kind of coverage and testing isn't what I personally gravitate towards in this space: that would be the more entertaining, wacky, oddball stuff that I think nobody covers better than LMG at its best. I'm talking about the kind of coverage that major component categories get around new launches: CPUs, GPUs, cases, CPU coolers and, to a lesser extent, power supplies.

The day-one review context is also important, because it imposes certain constraints on the coverage, some of which limit the possibilities and, frankly, the value of rigorous testing:

  • The reviewer is typically working with only one review sample of the product;
  • That review sample is provided by the manufacturer relatively close to the product launch, limiting the time the reviewer has to test and evaluate the product;
  • The reviewer is under an NDA and embargo (usually lasting until the product launch date), limiting the reviewers' ability to share data and conclusions with each other during the narrow window the day-one reviewers have to test.

First of all, for all these component categories, I'd like to know if the product suffers from a fatal flaw. This might be either a fatal design flaw that is apparent from the spec sheet (e.g. the limited memory bandwidth of lower-end 40-series GPUs) or something that is only uncovered through observation (e.g. 12-volt high-power connectors causing fires on higher-end 40-series GPUs).

The thing is, though, neither of those types of flaws is identified through rigorous day-one testing. The design flaws are sometimes apparent just from the spec sheet. In other cases, the spec sheet might raise a suspicion and some testing -- perhaps a customized regimen specific to ferreting out the suspicion -- is needed. And often, some level of expertise is required to explain the flaw. These are all valuable services these tech reviewers provide, but they are, by and large, not about rigorous testing.

Flaws that can only be detected through observation are rarely uncovered through the kind of rigorous testing these outlets do (and I don't think these outlets would claim differently). The typical pattern is that the product hits the market; users buy and use it; and some of them start to notice the flaw (or its effects). Then one or more of these outlets gets wind of it and does a rigorous investigation. This is an extremely valuable service these outlets provide (and also where GN really shines) but, again, you don't typically find it in a day-one review and it's not uncovered through testing.

I'm also looking for what I'll call 'spec sheet contextualization and validation.' I want to know what the manufacturer claims of the product in terms of features and performance. To the extent there's interesting, new, innovative or just unfamiliar stuff on the spec sheet, I'd love for it to be explained and contextualized. And I obviously want to know if the claim are to be believed. (There are also useful derivatives of the contextualization and validation that these reviewers often present and explain, for instance generational improvement, price-to-performance-ratio data and comparisons to competing products).

Some amount of testing is sometimes helpful for that contextualization and more or less required for validation. And particularly in the case of validation, some degree of well-designed-and-executed, standardized benchmarking is required. It makes sense to me, for example, for an individual reviewer to have a standardized test suite for new GPUs that uses a standardized hardware test bench and ~6-8 games that represent different performance scenarios (e.g. GPU-intensive, compute-intensive, etc.).

Things start to get questionable for me when outlets start to go much beyond this level though. The prime example of this is the component type where these outlets tend to emphasize the importance of their testing rigor the most: CPU coolers. To their credit, the reputable outlets recognize that getting accurate, apples-to-apples data about the relative performance of different coolers requires procedures and test setups that accommodate difficult-to-achieve controls for multiple variables: ambient temperature, case airflow, thermal compound application/quality, mount quality, noise levels and both the choice and thermal output of the system's heat-generating components, to name a few.

But the thing is, the multitude of factors to be controlled for under laboratory conditions undermines the applicability of those laboratory test results to actual use under non-laboratory conditions, possibly to the point of irrelevance. Hypothetically, let's assume that Cooler A (the more expensive cooler) keeps a given high-TDP CPU on average 3 degrees C cooler than Cooler B in a noise-normalized, properly controlled and perfectly conducted test. Here are a few of the factors that make it potentially difficult to translate that laboratory result to the real world:

  • Component selection: Though Cooler A outperforms Cooler B on the high-TDP CPUs reviewers typically used for controlled testing, the advantage might disappear with a lower-TDP CPU that both coolers can cool adequately. Alternatively, as we've seen with recent high-TDP CPUs, the limiting factor in the cooling chain tends not to be anything about the cooler (assuming it's rated for that CPU and TDP) but rather the heat transfer capacity of the CPU's IHS. I recently switched from a NH-U12S (with two 120mm fans) to an NH-D15 (with an extra fin stack and two 140mm fans) in my 5800X3D system and saw no improvement in idle thermals with the fans in both setups at 100% load, I suspect because of this very issue.
  • Mount quality: CPU coolers vary greatly in ease of installation. So even if Cooler A outperforms Cooler B when mounted properly, if Cooler A's mounting mechanism is significantly more error-prone (especially in the hands of an inexperienced user), that advantage may be lost. In fact, if Cooler B's mounting mechanism is significantly easier to use or less error-prone, it might actually outperform Cooler B for the majority of users because more of them will achieve a good mount. The same applies to...
  • Thermal compound application: Not only might a given user apply too much or too little thermal compound (where a reviewer is more likely to get it right), but, more deeply, the quality of the application and spread pattern can vary substantially between installation attempts, even among experienced builders, including, I would add, professional reviewers. Anyone who has built multiple PCs has had the experience of having poor CPU thermals, changing nothing about their setup other than remounting the CPU cooler (seemingly doing nothing differently) and seeing a multi-degree improvement in thermals. Outlets like GN providing contact heatmaps as part of their rigorous testing is a nod to this issue, but they typically only show the heatmaps for two different mounting attempts (at least in the videos), and that seems like too small a sample size to be meaningful. This brings up the issue of...
  • Manufacturing variance from one unit of the same product to another: At most, these outlets are testing two different physical units of the same product, and frequently just one. I don't know this, but I suspect that because good contact between the CPU heat spreader and cooler coldplate is such a key factor in performance, the quality and smoothness of the coldplate matters a lot, and is exactly the kind of thing that could vary from one unit to another due to manufacturing variance. All other things being equal, a better brand/sku of cooler will have less unit-to-unit variance, but the only way to determine this would be to test with far more than one or two units, which none of these reviewers does (and, indeed, none can do with just one review sample provided by the manufacturer). Absent that data, it's very similar to the silicon lottery with chips: your real-world mileage may vary if you happen to win (or lose) the luck-of-manufacturing draw.
  • Ambient temperature and environmental heat dissipation: Proper laboratory conditions for cooler testing involve controlling the ambient environmental temperature. That means keeping it constant throughout the test, which means that the test environment must have enough capacity to eliminate the heat the test bench is putting out (along with any other heat introduced into the test environment from the outside during the test period, like from the sun shining through the windows during the test). If the user's real-world environment also has this capacity, the test results are more likely to be applicable. If, on the other hand, the real-world environment can't eliminate the heat being introduced (say it lacks air conditioning, is poorly ventilated or has lots of heat being introduced from other sources), it changes the whole picture. Fundamentally, ambient temperature is a factor a responsible reviewer must control for in a scientific test. However, it is almost never controlled for in real-world conditions. And, arguably, the impact of uncontrolled ambient temperature is one of the most significant factors affecting quality of life in the real world (the other being noise, on which see below). From a certain point of view, PC cooling is about finding a balance where you get heat away from your components fast enough that they don't thermal throttle (or exhibit other negative effects of heat) but slow enough that you don't overwhelm the surrounding environment's ability to dissipate that heat. If the PC system outputs heat faster than the outside environment can dissipate it, the outside environment gets hotter, which sucks for your quality of life if you're also in that environment and trying to keep cool. This is why, considering only this issue, a custom water cooling solution with lots of fluid volume would yield a higher quality of life for most users than, e.g., a single tower air cooler. The greater thermal mass and conductivity of the fluid vs. the air cooler's heat pipes and fin stack allows for more heat to get away from the components quickly but remain internal to the system and then transferred into the environment over time, which is a better match for the primary ways we cool our environments (like air conditioning), which are better at dissipating relatively even, rather than spikey, heat loads.
  • Case and case airflow: I think this is by far the most significant factor in the real world. Any relative performance difference between Coolers A and B under laboratory conditions can easily be wiped out or reversed when either cooler is placed in a particular setup with particular airflow characteristics. Both coolers might perform great in a case with stellar airflow and perform poorly in one that is starved for airflow. But, more deeply, certain cooler designs perform better under certain case airflow conditions than others. An AIO where the radiator's fans can't create enough static pressure to overcome the case's airflow restrictions won't realize its full performance potential. Reviewers (rightly) try to create consistent test conditions that are fair to all the products being tested, but your setup probably isn't.
For these reasons, I regard relative performance data about different coolers under laboratory conditions as basically worthless, however rigorously collected it is. If I'm evaluating a cooler, what I actually care about are

  • The compatibility and rated performance of the cooler for a given CPU and case/mobo. This is spec sheet stuff, though some level of testing validation is valuable.
  • How easy and foolproof the mounting mechanism is, which is best surfaced through an on-camera build demonstration, not rigorous testing. Here, I find build videos far more valuable than product reviews, because if you see an experienced YouTuber struggling to mount a dang cooler, it should at least give you pause. I'd also note that build videos are inherently more entertaining than product reviews, because it's compelling to watch people struggle and overcome adversity, and even more fun when they do so in a humorous and good-natured way, which is a big part of the secret sauce of folks like Linus and PC Centric.
  • The noise level of any included fans when run at, say, 30%, 50%, 80% and 100% load. This might be idiosyncratic to me (though I suspect not), but I'm particularly sensitive to fan noise. Given that the cooler can fit in my case and can handle the output of my CPU, what matters to me is how noisy my whole system is going to be, both at idle and under load. With any cooler, I assume I'm going to have to tune the curves of both its fan(s) and my case fans to find the best balance of noise and cooling across different workloads (e.g idle, gaming load, full load). I can't possibly know how this will end up in my build in advance, and rigorous testing under laboratory conditions doesn't help me. So the best I can hope for from a reviewer is to give me a sense of how much noise the cooler's fans will contribute to overall system noise at various RPM levels. (This is the primary reason I favor Noctua fans and coolers and am willing to pay a premium for them: they are super quiet relative to virtually all competitors at either a given RPM or thermal dissipation level. And it's the primary advantage of switching to the D15 in my current setup, since the larger fans and dual tower design mean it can dissipate more heat with less noise than the U12S.)
Stated another way: If a cooler is rated for my CPU, fits in my case and can be mounted properly with a minimum of fuss, I assume it can adequately cool my PC at some noise level. The only question is how noisy, and that's a function of how fast I need to run my system's fans (including the cooler fan(s), but also every other fan) to achieve adequate cooling, and of the noise level of my fans (again, including, but not limited to, the cooler fan) when run at those speeds. No amount of laboratory testing can answer that question, however rigorous (unless it were conducted on a test bench identical to my system, which is unlikely).

I've been throwing the word 'rigorous' around, and it's worth decomposing what it means and why 'rigorous' testing is (or isn't) valuable to the consumer. One aspect of it is just that the test is conducted properly and free of human error (and that the rigor of the process makes it easy to identify when human error is committed and correct it). Another aspect is that the testing methodology itself is well-designed insofar as it provides accurate and useful information to the consumer. My main concern with 'rigorous' testing in many of these product categories (especially with CPU coolers) is that the rigorous, laboratory testing methodologies don't yield especially useful information that can be applied outside of laboratory consumer conditions.

Another aspect of rigor is repetition/replicability. Again, there are different dimensions to this. Certainly, a rigorous reviewer ought to conduct multiple trials of the same test to see if their results are consistent. But the thing is, this is more of a check on other aspects of the rigor of the individual reviewer's own methodology and work than anything else. If a review does, say, 50 trials (which is, realistically, way more than any of these outlets are doing for PC components) and finds that 5 trials are significant outliers, it suggests one of three things:

  1. The tester committed human error on the outlier test runs, in which case they should try to track it down and correct it, or else throw out the results of those five trials.
  2. The testing methodology fails to account for some confounding factor that was present in those five cases and not the others, in which case the reviewer ought to track that down and control for it if possible.
  3. The individual unit being tested (remember, these reviewers are typically testing only one unit of the product being evaluated) exhibits weird behavior. Technically, this is an instance of (2) because something must be causing the particular unit to behave oddly, it's just that the reviewer hasn't been able to control for that something. And given time constraints on day-one reviews especially, this is when an individual reviewer is most likely to say 'I don't know... maybe I have a defective unit here, but I can't be sure and don't have the time or resources to investigate further.'
So, again, an individual reviewer doing multiple trials is valuable, but primarily because it helps that reviewer identify problems with their own execution, methodology or the individual unit being tested. A consumer should have more confidence in the data from a reviewer who performs 'rigorous' testing in this sense, but only to the extent their methodology is basically sound and with the understanding that any one reviewer's results have limited value in extrapolating to how a different unit of that product will perform for you, even under otherwise identical conditions because.

The kind of rigor that does help a consumer to have confidence that the results will apply to them is the kind that comes from the repetition and replicability achieved by multiple reviewers reviewing different units of the same product using the same (basically) sound methodology. This is the kind of rigor that modern laboratory science provides (e.g. a scientist achieves a certain result, publishes his methodology and findings and then other scientists follow the same methodology with different people and materials and see if the result is replicated). It's also why it's important to consider multiple reviews from multiple reviewers when evaluating a product as a consumer. Consistent results across multiple reviewers makes it more likely that you will achieve a similar result if you buy the product. Inconsistent results suggest manufacturing variances, problems with quality control, design flaws that lead to inconsistent behavior across units, etc.

So what really would be of greatest value to consumers in the space isn't more elaborate methodology and more trials (i.e. more rigor) by any one reviewer. It's many reviewers following (basically) the same methodology where that methodology does the minimum necessary to produce meaningful results and can be easily replicated; and that the methodology incorporates the minimum number of trials by an individual reviewer to reasonably provide for error identification and correction by that individual reviewer. If folks are interested in raising the PC hardware review sector (and discourse), figuring out how to achieve that is what they should be striving for.

No comments: