Graphics and System Readings

- (14 min read)

Blogs and articles that I found interesting.

[[toc]]

The Mozilla Observatory that scan your site to find vulnerabilities

https://observatory.mozilla.org/

"After Working at Google, I’ll Never Let Myself Love a Job"

https://web.archive.org/web/20210731125322/https://www.nytimes.com/2021/04/07/opinion/google-job-harassment.html

I had spent six months doing two internship at Google. This article reminds me how I still love the time I spent there. The most part of the reason is that Google takes good care of everything and you only need to focus on your work. I got great food, easy commute, and all kind of social/physical benefits (I run full court basketball three times a week at weekdays and spent every weekend hiking with my co-workers LOL). Indeed it is "structured my life around my job".

Now, I work for a company pays less and provides far little benefits. The only better part is that I have more interesting things to work on than what I did at Google. But I still enjoy the work not a bit less. In the end, it is the work that matters. Other benefits are actually easy to get in normal life, for example by having a family. By cut off my life from my work, I think I have a much healthier work experience than the Google days.

Query Engines: Push vs. Pull

http://justinjaffray.com/query-engines-push-vs.-pull/

I had spent some time working on a query optimizer when doing database system research. At that time, all the lab people believe Push based execution is a clear win. Not limit to cache efficiency, the data can even be kept entirely in registers if the operation footprint is small as mentioned in Thomas Neumann's paper.

Still the shortcoming of the Push based execution mentioned in the article is valid. In merge join, Push based model may need to materialize more data than necessary. Though I fail to see how LIMIT alone can create problems for Push based model. I may need to spend some to read Shaikhha's paper :)

Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!

https://blog.cloudflare.com/branch-predictor/

Got to know the performance impact of Branch Target Buffer for the first time. Great writeup. Good experiments and analysis.

How I Learned To Love Tail Calls in C

https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.htmlQ

Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons.

Better optimized code are generated by compiler when rewrite a switch case with a chain of tail calls? Great writeup.

Brendan Gregg's Blog - USENIX LISA2021 Computing Performance: On the Horizon

https://brendangregg.com/blog/2021-07-05/computing-performance-on-the-horizon.html

My key takeaways: NUMA architecture would die?

Modern Microprocessors - A 90-Minute Guide!

http://www.lighterra.com/papers/modernmicroprocessors/

  • "Register Renaming & OOO"

RISC, software register renaming versus hardware renaming

  • "The Brainiac Debate"

  • "The Power Wall & The ILP Wall"

One of the most interesting members of the RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register-renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OOO logic either. The software-based x86 translation did reduce the system's performance compared to hardware translation (which occurs as additional pipeline stages and thus is almost free in performance terms), but the result was a very lean chip which ran fast and cool and used very little power. A 600 MHz Crusoe processor could match a then-current 500 MHz Pentium III running in its low-power mode (300 MHz clock speed) while using only a fraction of the power and generating only a fraction of the heat. This made it ideal for laptops and handheld computers, where battery life is crucial. Today, of course, x86 processor variants designed specifically for low power use, such as the Pentium M and its Core descendants, have made the Transmeta-style software-based approach unnecessary, although a very similar approach is currently being used in NVIDIA's Denver ARM processors, again in the quest for high performance at very low power.

  • "More Cores or Wider Cores?"
    cost of OOO SMT area

but the maximum outright single-threaded performance of large cores, perhaps in the future we might see asymmetric designs, with one or two big, wide, brainiac cores plus a large number of smaller, narrower, simpler cores. In many ways, such a design makes the most sense – highly parallel programs would benefit from the many small cores more than a few large ones, but single-threaded, sequential programs want the might of at least one large, wide, brainiac core, even if it does take four times the area to provide only twice the single-threaded performance.

Amazing article from beginning to end. Literally spent more than two hours reading it.

Introduction to open source private LTE and 5G networks

https://ubuntu.com/blog/introduction-to-open-source-private-lte-and-5g-networks

Very attractive stack. It is safer than Wifi and probably more suitable to create a more secure smart home setup. But looks like you may need to license it ??

Some comments from HackerNews discussion:

I'd love to setup a small scale network for personal use, but the elephant in the room is licensing.... Is it actually possible to formally license a DIY LTE network? (Or AMPS, 2G, 3G). In the UK, for example it is possible to get experimental licenses in principle, but I doubt it would be possible for an individual to legally operate a permanent or long term network on/near commercially allocated frequencies? You need to put a custom SIM into every device, and set up a PLMN identity for the network, which is just a 5 or 6 digit number that identifies the network to handsets. The SIM tells the phone what network it should try to join, and contains the crypto keys used to do authentication with the network. You can often get a PLMN allocated by your national telecoms regulator, or use one in the 999/xx range, which are set aside for private, uncoordinated use.

linear interpolation in frequency domain

http://conf.uni-obuda.hu/cinti2009/54_cinti2009_submission.pdf I had joined a computer graphics basic knowledge training for new hires. It came across the texture filer and start wondering how effective the bilinear filter (the most common one) to reduce aliasing. Here it is in the frequency domain, bilinear interpolation is already a pretty good low pass filter. This paper includes the cubic interpolation as well. There is a comparison showing how much better the cubic one is.

The Ultimate Guide to Inflation

https://www.lynalden.com/inflation/

Banks, QE, and Money-Printing

https://www.lynalden.com/money-printing/

These two comes in a serial. Both are pretty good written. Actually some of the fact are easy to think through, for example the reason that the Great Financial Crisis (2008) does not cause big inflation. The fiscal authorities play together with the banks system is new to me. I also find that the microeconomic examples to show how loads/savings affect the broad money supply very interesting.

Software reciprocal

http://pvk.ca/Blog/LowLevel/software-reciprocal.html

Very powerful optimization techniques. I think it falls into the category of numerical analysis. An IEEE float is composed of three parts, sign, exponents, and significand. When computing the reciprocal, simple taking the opposite value of exponents can give a pretty close approximation. However, it is not accurate enough to let the Newton method converge. Therefore, the author did more numerical analysis to find a magic number which produce a close enough reciprocal for any float number to subtract. It is much simpler and easier to understand than the "fast inverse square root".

Beating the L1 cache with value speculation

https://mazzo.li/posts/value-speculation.html

Value dependency sometimes can be eliminated, good technique learned.

The stack monoid revisited

https://raphlinus.github.io/gpu/2021/05/13/stack-monoid-revisited.html

A follow up on the stack monoid. To be honest, I does not fully understand the algorithm. It is a little cumbersome to read. But I kept trying to read it, because design a monoid for an algorithm to parallel it is probably a universal approach.

The key takeaways for the last read: PRAM could give a theoretical upper bound of parallelism.

Finding Windows HANDLE leaks, in Chromium and others

[https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/] (https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/)

Good writeup. I don't have much Windows experience. On Linux checking fd handles leak is much easier, but I am not aware of tool that can do a good visualization on it. WPA is pretty impressive.

Some funny comments from HackerNews:

First thing I do on a new Windows setup is turn on "handles", "threads", "Commit size", "NP pool" columns in task manager details... If you want to see some real offenders... have a quick look at Asus's "LightingService.exe" (the daemon that controls their rgb LED coloring suite). Gets up to 2m+ handles after a day or two of running on my system.

The Night Watch

https://www.usenix.org/system/files/1311_05-08_mickens.pdf

In case the title being misleading, it is actual an article from a system researcher talking about how "great" system researcher and developers are. To be honest, I don't like the attitude in this article that system people being superior than others. Though, it might just be some acceptable literary techniques.

I have to admit the languages are vivid and the metaphors are unique. IMO, it might be a good reference for managers in charge of a bunch of system people to quote some motivation slogans. And it might work well. System people are indeed "nerds" :)

What mRNA is Good For, And What It Maybe Isn’t

https://blogs.sciencemag.org/pipeline/archives/2021/06/29/what-mrna-is-good-for-and-what-it-maybe-isnt

Okay it is probably hard for mRNA find its place outside of vaccines :(

  1. Producing proteins from mRNA does not last, you have to keep receiving mRNA
  2. Not trigger any immune response is hard (the opposite usage of vaccination)
  3. Targeting is hard
  4. We got several better alternatives than mRNA

Visibility Buffer Rendering with Material Graphs

http://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs/

Great writing, great experiments and great numbers.

It presents a new rendering technique aside from forward rendering and deferred rendering. The visibility buffer rendering further breaks the second deferred rendering pass into two.

The most amazing part is that almost everything is done in compute shaders! Handwritten shader code for interpolation and texel fetching. As far as I know, the above two steps are done on hardware in GPUs. It means we can throw away part of the circuit on the GPU and still get almost the same performance doing rendering with this technique. That is AMAZING!!

As a graphics driver developer, I believe one day GPU(if they are still called this name) will have no hardware blocks for graphics pipelines and only having the compute pipelines. I think this algorithm will play a big role in our way towards that future.

Okay, forget about the above subjective opinions, when I was reading the quad utilization part. I am not convinced that it would affect performance much. Modern GPU runs more than 30 threads at once. So the low quad utilization for less than 30 or so pixels are no cost at all. However, the experiments shows that visibility buffer rendering performs much better than deferred rendering in high triangle per pixel cases. I am not sure if there are any other reasons outside of quad utilizations here.

The experiments only shows one material cases, though I suppose visibility buffer rendering will perform better with more materials. Doing everything in CS means far less PSO switches. And PSO switches are really a big performance killer in today's games.

Decoupled Visibility Multisampling

http://filmicworlds.com/blog/decoupled-visibility-multisampling/

A follow up on the above article. Anti-aliasing in visibility buffer rendering. I don't understand the algorithm but it's nice to see it has a solution for aliasing.

Project Starline: Feel like you're there, together

https://blog.google/technology/research/project-starline/

So cool! Light field display from a device that looks like a normal monitor, I think it is cooler than the streaming and compression technique used here!!

I always wondering why light field rendering does not become a thing? For almost static scenes, we can do one offline light field rendering pass. Then with a query and sampling, we can get pretty realistic views efficiently. I think it is quite achievable in today's "cloud" world.

On leaving California and the Silicon Valley

https://bartwronski.com/2021/06/28/on-leaving-california-and-the-silicon-valley/

Haha, I don't like Cali either. But not because it lacks city life, but it has very poor natural sceneries(Bay area only, drive two hours gets you to Yosemite, that does not count). For example, I have been to Boulder, CO for a summer, it has far more approachable and better mountains (better than Yosemite). Countless mountains over 3000+ meters, some are even within 20 minutes walking distance. And at Boulder, there is a lot city life, too. I believe it is same with other middle states' small towns.

https://ciechanow.ski/naval-architecture/

To know something related to boat just for fun. The article is very approachable. The second half about propulsion is very interesting mathematically.

Programming Language Memory Models

https://research.swtch.com/plmm

I spent maybe two hours reading this, but I think it is really worth it. Memory model is hard, putting several memory models together makes it harder. I tried very hard to keep my mind thinking straight. The most surprising thing is that a non-coherent read/write is defined behavior in Java?! Okay, it does makes it much easier for finding bugs and debugging.

Anatomy of a Linux DNS Lookup

https://zwischenzugs.com/2018/06/08/anatomy-of-a-linux-dns-lookup-part-i/

I always get very confusing when setting up DNS at home. This serial of articles does a great job showing how DNS works from reading different configurations. DNS is really a weird design on Linux :(