I have uploaded a very brief document on optimizing several families of reversible circuits.
Original version
An engineering issue.
I have uploaded a very brief document on optimizing several families of reversible circuits.
Original version
If you would like to contribute to reversible computing, why don’t you make a chip?
There many academic papers describing reversible logic circuit families, the results of circuit simulations, and sometimes measurements of fabricated chips. Making a chip used to require a project with a million-dollar (USD) budget and was further inaccessible to individuals and small groups because commercial semiconductor fabs required Non-Disclosure Agreements (NDAs).
Yet, the Sky130 open-source chip design program https://github.com/google/skywater-pdk emerged in the last couple years that allows anybody with internet access and a laptop to design a chip. Fabbing the design is possible through a Google/SkyWater free multi-project wafer fab https://www.skywatertechnology.com/mpw/open-source-mpw-program/. Basic testing should be possible with readily available lab equipment.
Let me describe what I did over a period of a couple weeks. Individuals and small groups can make important innovative contributions to technology at certain phases of development. So, I tried to see what I could do from an office in an extra bedroom in my house and with a handful of Windows 10/11 PCs. Having worked in the field for some time, my experience is not representative of a new entrant, but I wanted to demonstrate setting up the open-source tool chain to become an effective platform for reversible computing R&D.
I devised the reversible logic family Q2LAL [DeBenedictis 21], [DeBenedictis 22] about a year ago and it had never been subject to intensive engineering analysis or physical testing, so it seemed like a good starting point. The basic replication unit in Q2LAL is an adiabatic amplifier followed by a transmission gate, illustrated in [DeBenedictis 21, Fig. 3 and 4c]. The circuit ultimately has to be lain out in arrays that effectively exploit symmetries inscrutably buried in the circuit structure. Fig. 1 is my second attempt at a layout planning. It is in an ad hoc format that I found convenient and which I will use for explanation.
The top quarter, outlined by a dotted gray rectangle with Âi-1 on the left and Q̂i on the right, has six transistors organized in a row. Each transistor is represented by the schematic symbol for a transistor (identifiable as a short vertical red line). The four transistors on the left of each row comprise an adiabatic amplifier [DeBenedictis 21, left side of Fig. 4c] and the two on the right are a transmission gate [DeBenedictis 21, small rectangle in Fig. 3b].
I created the layout in Fig. 2 using Magic, a tool in the open PDK. For the reader’s orientation, the six transistors in each row of Fig. 1 each correspond to a red (vertical) rectangle crossing a wider green or orange horizontal region. The vertical power and clock lines were not present at this point.
I created the layout in Fig. 2 with Fig. 1 visible in one window of a PC and the Magic workspace in a second window, going back and forth between the two until I could make a layout that was both compact and satisfied the constraints.
Q2LAL has dual-rail signaling, with the second-from-the-top gray rectangle in Fig. 1 processing the – rail [DeBenedictis 21, right side of Fig. 4c]. The circuits need signals from both rails, leading to the crossover between  and – signals in the top two quarters of Fig. 1. The top two quarters are flipped vertically with respect to accommodate the crossover (and likewise for the third and fourth quarters).
Q2LAL circuits compute in the forward direction and recover energy in the reverse direction. Energy recovery uses the circuit in the top two quarters, but horizontally flipped. Thus, the bottom half of Fig. 1 was created by flipping the graphic of the top half in Microsoft Word.
The entire structure in Fig. 1 must be designed so it can be replicated both horizontally and vertically. Horizontal replication controls the length of the shift register and vertical replication increases the word-width of the stored information.
Replication leads to geometric constraints. For example, the vertical blue (metal) lines would carry power (V), ground (G) and various clock phases φ to all the circuits in a column. But there is a complication. The circuit has one set of wires for the even-numbered circuits and another for the odd-numbered circuits. Thus, the symmetries of circuit must cause extension of each blue line to line up with a gap in the circuit above and below. This is easily seen to be true, but creating the layout required solving a puzzle.
I thus enhanced the layout in Fig. 2 to accommodate arrays, yielding Fig. 3.
Fig. 4 shows four copies of the layout in Fig. 3 in a stack with the following orientation changes:
This yields a Q2LAL stage, such as the left or right half of [DeBenedictis 21, Fig. 3b]. An experienced designer will see that I have not fully mastered Sky130 and Magic.
The next step was to integrate the layout with the Q2LAL ngspice simulations discussed elsewhere on this site (https://revcomp.org/q2lal/). Much of the dissipation in integrated circuits is due to the wires rather than the transistors, so simulation of the layout with wire capacitance would tend to validate the advantage of reversible computing over CMOS.
Unlike academic papers where schematics are created with just thinking, layouts like Fig. 2 can be automatically “extracted” to yield a netlist such as the one below. The netlist is pretty much what a designer would create manually for an academic paper and it can be verified as correct in a few minutes through examination.
* SPICE3 file created from q2v13.ext - technology: sky130A
.subckt q2v13
X0 G ck T G sky130_fd_pr__nfet_01v8 ad=4.65e+11p pd=3.9e+06u as=1.28e+12p ps=1.07e+07u w=450000u l=150000u
X1 T -A G G sky130_fd_pr__nfet_01v8 ad=0p pd=0u as=0p ps=0u w=450000u l=150000u
X2 T A phi0 G sky130_fd_pr__nfet_01v8 ad=0p pd=0u as=2.7e+11p ps=2.1e+06u w=450000u l=150000u
X3 -Q phi3 T Vp sky130_fd_pr__pfet_01v8 ad=2.25e+11p pd=1.9e+06u as=2.7e+11p ps=2.1e+06u w=450000u l=150000u
X4 -Q phi7 T G sky130_fd_pr__nfet_01v8 ad=4.075e+11p pd=3.6e+06u as=0p ps=0u w=450000u l=150000u
X5 T -A phi0 Vp sky130_fd_pr__pfet_01v8 ad=0p pd=0u as=2.25e+11p ps=1.9e+06u w=450000u l=150000u
.ends
However, design tools can also extract information about wires, such as length, resistance, and capacitance. This information is human readable, but understanding the impact on performance, dissipation, etc. requires circuit simulation. For example, the underlined portion of the excerpt below (starting with “cap”) gives the capacitance between pairs of wires φ0 and –Q.
timestamp 1651599653
version 8.3
tech sky130A
style ngspice()
scale 1000 1 500000
resistclasses 4400000 2200000 950000 3050000 120000 197000 114000 191000 120000 197000 114000 191000 48200 319800 2000000 48200 48200 12800 125 125 47 47 29 5
parameters sky130_fd_pr__nfet_01v8 l=l w=w a1=as p1=ps a2=ad p2=pd
parameters sky130_fd_pr__pfet_01v8 l=l w=w a1=as p1=ps a2=ad p2=pd
node "T" 3028 286.743 -80 40 ndif 0 0 0 0 0 0 0 0 51200 2140 10800 420 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50400 2360 0 0 0 0 0 0 0 0 0 0 0 0
[snip]
cap "phi0" "-Q" 30.2468
cap "-A" "ck" 13.56
[snip]
device msubckt sky130_fd_pr__nfet_01v8 1000 40 1001 41 l=30 w=90 "G" "phi7" 60 0 "T" 90 0 "-Q" 90 0
[snip]
Obtaining this information requires a test layout of the circuit so wire geometry is known and can be subject to a numerical assessment of how close the wires come to each other. Some circuits have more internal interconnectivity than other circuits, which leads to denser wiring with more crosstalk that ultimately reduces speed and increases power. These issues are inherent to any design process, but can be controlled with tools such as those that produced the diagrams in this document.
So, here is the unexpected and somewhat untidy story. The English description used a hierarchy of (1) dual-rail adiabatic amplifier [DeBenedictis 21, Fig. 4c] (2) dual-rail adiabatic amplifier+latch (3) bi-directional stage [DeBenedictis 21, Fig. 3b], but the layout was constructed as (1) adiabatic amplifier+latch+wire flyovers, Fig. 3 (2) dual-rail (3) bidirectional stage using wire flyovers, Fig. 4. So, I had to substantially rewire the sQ2.cir file to support the different hierarchy. The result is sQ130.cir in the repository. Note that sQ130.cir has more functions than the layout because it includes a test harness, creates plots, etc.
I then took the .spice file with parasitic extraction that Magic created and manually pasted the capacitance values into sQ130.cir. I included .if (0)-type lines so I could compare the dissipation with and without the parasitics. The results were 1.99532E-14 and 6.02372E-15 J with parasitics on and off.
I fulfilled my objective of validating the suitability of the open-source design flow, although I did not carry this specific project to an R&D milestone.
If I were to proceed as an individual, it would take me another couple months to make a test design that would contribute to the field (such as the note-note design https://revcomp.org/note-note/). The test design could be fabricated with the Google-subsidized multi-project wafer service, which has about a four-month turn-around time. After chips come back, testing is possible with readily available lab equipment – although some more exotic testing (cryo?) would take a lot of specialized equipment.
In a very brief summary, I loaded the Docker version of the SkyWater Open Source PDK https://github.com/google/skywater-pdk on several Windows 10 and 11 laptops in my office. I have general knowledge of IC design from a university class, but the YouTube video Creating a Hierarchical Layout in Magic Using the Sky130 PDK [bminch 21] is a tutorial that describes commands specific to the Magic open sources.
The Github repository is not fully set up at the moment.
The Magic files are QAAmp11.mag, QLatch.mag, and QPhase.mag available at https://github.com/erikdebenedictis/Shift. The Github repository is not fully set up and this information may change.
An earlier version of the Magic data files can be downloaded with the links below. Due web page limitations, the file name extension has been changed from .mag to .txt. You will need to change the file names back to .mag, after which they can be loaded into the Magic tool.
I recommend the approach above for students and hobbyists. You can make a difference even with modest resources.
There is enough promise and opportunity in reversible computing that larger organizations and funding agencies are ponying up real R&D money. Such organizations and the people in them can perhaps take inspiration from this article and realize that the entry point may be lower than it was a few years ago.
[DeBenedictis 21] Quiet 2-Level Adiabatic Logic. Zettaflops, LLC Technical Report ZF009 https://ar.zettaflops.org/CATC/Q2LAL.pdf
[DeBenedictis 22] Q2LAL page on this website https://revcomp.org/Q2LAL.
[bminch 21] bminch. Creating a Hierarchical Layout in Magic Using the Sky130 PDK, https://www.youtube.com/watch?v=RPppaGdjbj0
High performance adiabatic computing systems will need an engineered powertrain, which will be new technology given that current adiabatic demonstrations are too small to reveal important issues. This will be true even if the high performance computer is a quantum computer.
Todays high performance computing systems dissipate around 200 W per chip. For high performance computers, the objective is more performance for the same 200 W per chip, not lower power chips.
Reversible circuits use multiple AC power-clocks. If the chip is running at 200 W per chip, the power-clock generators will need to be far enough away from the chip to avoid excessive dissipation in a small volume causing cooling difficulties. Hence, a separation will be required between the power-clock generators and the chip, as shown in the figure below.
CMOS microprocessors require clock waveforms to be generated within a few centimeters of the chip to avoid waveform distortion. However, reversible systems send power over the AC power-clock conductors, which carry a lot of power hence require a larger separation than just microprocessor clock lines. In conjunction with requirements for a precise waveforms, both classical and quantum system analyses [ZF008] indicate that the power-clock conductors are long enough that transmission line effects need to be considered, i. e. characteristic impedance and reflections.
In summary, document [ZF008] describes how the power-clock generators must launch predistorted power-clocks into the transmission lines such that they arrive the with the desired waveform.
Document [ZF008] also describes how the load presented by the chip affects the proper predistortion, and how different circuit families and design techniques can reduce load variance. Some design techniques inevitably case load variance, such as turning clocks to unused portions of the circuit on and off. The document describes how power-clock generators can change their predistorted waveform in anticipation of changes in chip load.
The page http://revcomp.org/optimal-ramps further describes how the linear ramp waveform frequently found in the literature is a good starting point but the lowest dissipation shape is flatter in the middle (an “s” shape). The overall result is that predistortion needs to be applied to the “s” shape, not a linear ramp.
The a single shift register stage places very little loading on a 50 Ohm coax and produces little distortion. While a single shift register stage may be easy to simulate in Spice, it is not representative of a high performance computer. Scaling the shift register length to 5,000 stages is more representative of a scaled up system, but the circuit is too big for Spice simulation. To address these issues, the script in appendix of [ZF008] scales transmission line parameters. Multiplying the characteristic impedance by 5,000 in a Spice simulation model yields the same voltage waveform as multiplying the number of shift register stages. See [ZF008] for details.
The script in [ZF008] simulates driving a circuit with predistortion to compensate for transmission line distortion.
Say you are tasked to make an ASIC for a computational accelerator and you have the ability to create a CMOS chip that can dissipate 200 W. The task is to asses the feasibility using reversible logic circuits, yet still using the same CMOS process. Based on the figure above, say the power-clocks will drive 2,000 W into the chip of which 1,800 W comes out. That leaves the 200 W dissipation, so you will not have to change the chip’s cooling.
First, engineer the power-clocks and figure out how to cool them. This will give an estimate of the volume required by the power-clock generators and 2,000 W of heat exchangers. This will then give an idea of how long the cables will be and hence the impact of cable RF losses and reflections. Then figure out the predistorted waveform, starting with the objective of a predistortion to yield a linear ramp and then a predistortion for an optimized waveform.
The second topic is to repeat the task above for a 4 K stage in a quantum computer. Say the quantum computer needs a 1 MHz clock, which will determine the fraction of energy dissipated versus ejected. From this, engineer room-temperature power-clocks and the wiring to get the signals to the 4 K stage. This wiring will be longer than the previous example because they will have to pass through a cryostat boundary. Include filtering for extra credit. Figure out the predistorted waveform for both linear optimized waveforms at the chip.
[ZF008] Erik DeBenedictis. Energy Management for Adiabatic Circuits. Zettaflops, LLC Technical Report ZF008, v1.2, April 15, 2021 https://www.debenedictis.org/erik/CATC/EMgt4Adia-ZF008-v1.2.pdf.
Additional information
The literature almost universally depicts reversible logic clocks with linear rising and falling segments, yet ramps that are flatter in the middle dissipate less. The reason for linear ramps that the simplest explanation of adiabatic behavior assumes transistors have a fixed Ron when conducting, and linear ramps have the lowest dissipation when Ron is constant. The simplest explanation is given first.
Ron actually decreases with larger forward bias on the gate, and behavior is further complicated by saturation. The lowest forward bias is at the middle of the swing, or the midpoint of the ramp.
In simple terms, the best waveform rises quickly when the transistor is on strongly or off, but needs to slow down when the resistance is higher to avoid I2R losses.
Ramps are described in this software as having unit width and height, so the baseline linear ramp goes from (t=0, v=0) to (t=1, v=1). Pretty good ramps can be created by dividing the unit time into five linear sub ramps of length .2. With two parameters, h1 and h2 (h for height), the segmented ramp will go through the following points: (t=0, v=0), (t=.2, v=h1), (t=.4, v=h2), (t=.6, h=1-h2), (t=.8, h=1-h1), (t=1, v=1).
In developing software on this site, it was found that dissipation can be reduced by around 30% through proper ramp shape, with 5 segments adequate to get within about 1% of optimal.
For many circuits, the parameters o1=.20 h1=.34 o2=.40 h2=.46 are a good starting point. H1 and h2 can be fine tuned by optimization. For additional generality, the time divisions are defined by statements o1=.2 and o2=.4 (o for offset), although this is not necessary in most cases..