

## LJMU Research Online

Zheng, X, Zhang, C, Lv, F, Zhao, F, Yuan, S, Yue, S, Wang, Z, Li, F, Wang, Z and Jiang, H

A 40-Gb/s Quarter-Rate SerDes Transmitter and Receiver Chipset in 65-nm CMOS

http://researchonline.ljmu.ac.uk/id/eprint/9304/

Article

**Citation** (please note it is advisable to refer to the publisher's version if you intend to cite from this work)

Zheng, X, Zhang, C, Lv, F, Zhao, F, Yuan, S, Yue, S, Wang, Z, Li, F, Wang, Z and Jiang, H (2017) A 40-Gb/s Quarter-Rate SerDes Transmitter and Receiver Chipset in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 52 (11). pp. 2963-2978. ISSN 0018-9200

LJMU has developed LJMU Research Online for users to access the research output of the University more effectively. Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Users may download and/or print one copy of any article(s) in LJMU Research Online to facilitate their private study or for non-commercial research. You may not engage in further distribution of the material or use it for any profit-making activities or any commercial gain.

The version presented here may differ from the published version or from the version of the record. Please see the repository URL above for details on accessing the published version and note that access may require a subscription.

For more information please contact researchonline@ljmu.ac.uk

http://researchonline.ljmu.ac.uk/

# A 40-Gb/s Quarter-Rate SerDes Transmitter and Receiver Chipset in 65-nm CMOS

Xuqiang Zheng, Chun Zhang, Member, IEEE, Fangxu Lv, Feng Zhao, Shuai Yuan, Shigang Yue, Senior Member, IEEE, Ziqiang Wang, Fule Li, Zhihua Wang, Fellow, IEEE, and Hanjun Jiang, Member, IEEE

Abstract—This paper presents a 40-Gb/s transmitter (TX) 1 and receiver (RX) chipset for chip-to-chip communications in 2 a 65-nm CMOS process. The TX implements a quarter-rate 3 multi-multiplexer (MUX)-based four-tap feed-forward equalizer (FFE), where a charge-sharing-effect elimination technique is introduced into the 4:1 MUX to optimize its jitter per-6 formance and power efficiency. The RX employs a two-stage 7 continuous-time linear equalizer as the analog front end and 8 integrates a low-cost sign-based zero-forcing engine relying on edge-data correlation to automatically adjust the tap weights of 10 the TX-FFE. By embedding low-pass filters with an adaptively 11 adjusting bandwidth into the data-sampling path and adopting 12 high-linearity compensating phase interpolators, the clock data 13 recovery achieves both high jitter tolerance and low jitter 14 generation. The fabricated TX and RX chipset delivers 40-Gb/s PRBS data at BER  $< 10^{-12}$  over a channel with >16-dB loss at 15 16 half-baud frequency, while consuming a total power of 370 mW. 17

*Index Terms*—4:1 multiplexer (MUX), 40 Gb/s, chargesharing effect, clock data recovery (CDR), continuous-time linear equalizer (CTLE), edge-data correlation, feed-forward equalizer (FFE), jitter suppression, jitter tolerance (JTOL), low-pass filters (LPFs), sign-based zero-forcing (S-ZF), transmitter (TX) and receiver (RX) chipset.

#### I. INTRODUCTION

24

THE exponential growth of cloud computing, social net-7 25 working, and multimedia sharing has led to an explosive 26 bandwidth demand on data communication in both telecom-27 munication equipment and inter/intra data center [1], [2]. 28 To accommodate to this requirement, the data rate of 29 the wireline serializer/deserializer (SerDes) transceiver has 30 been continuously increased [3]-[5]. Currently, 25-28 Gb/s 31 serial links approved by InfiniBand EDR, 32GFC, and 32 CEI-28G have stepped into the period of industrial 33 deployment [1], [3], [6]. The 38-64 Gb/s transceivers, which 34

Manuscript received March 17, 2017; revised June 23, 2017 and August 18, 2017; accepted August 21, 2017. This paper was approved by Associate Editor Jack Kenney. This work was supported in part by the China 863 Program under Grant 2013AA014302, in part by European FP7-LIVCODE under Grant 295151, and in part by HAZCEPT under Grant 318907.

X. Zheng is with the Institute of Microelectronics, Tsinghua University, Beijing 100084, China, and also with the School of Computer Science, University of Lincoln, Lincoln LN6 7TS, U.K.

C. Zhang, F. Lv, S. Yuan, Z. Wang, F. Li, Z. Wang, and H. Jiang are with the Institute of Microelectronics, Tsinghua University, Beijing 100084, China (e-mail: zhangchun@tsinghua.edu.cn).

F. Zhao and S. Yue are with the School of Computer Science, University of Lincoln, Lincoln LN6 7TS, U.K. (e-mail: syue@lincoln.ac.uk).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2017.2746672

will play a key role in the next-generation data rate sup-35 ported by Ethernet 400GbE, InfiniBand HDR, and CEI-56G, 36 have attracted increasing attentions in the industry and 37 the academia [2], [4], [5], [7]–[11]. The main challenges in 38 designing such high-speed transceivers originate from the ever 39 decreased unit interval (UI) period, which not only poses 40 high bandwidth requests on the blocks located at the critical 41 path, but also makes the link timing budget extremely tight. 42 Moreover, advanced processes cannot completely solve these 43 problems, since the parasitic capacitances/resistances at the 44 high-speed outputs usually do not scale well with the technol-45 ogy due to the bonding and/or electro-static discharge (ESD) 46 protection requirements. 47

The major difficulty in the transmitter (TX) design is insufficient timing margin for the final-stage serialization. To address this issue, traditional half-rate TXs often apply extra delay matching buffers [10], [12] or phase calibration loops [9], [13], [14] to guarantee an appropriate data selection window. These techniques result in substantial power and area overhead. An alternative solution is to replace the last three 2:1 multiplexers (MUXs) with a single 4:1 MUX [4], [11], [15], [16]. The resulting quarter-rate serialization relaxes the critical path timing margin to 3 UI, halves the maximum clock speed, and saves considerable power. These benefits come with the penalty of a doubled self-loading drain capacitance, which dramatically degrades the bandwidth of the 4:1 MUX, hence limiting its maximum operation speed.

The main challenge in designing high-speed clock data 62 recovery (CDR) is how to satisfy the bandwidth require-63 ment while maintaining excellent jitter performance. In many 64 SerDes protocols, the CDR bandwidth grows linearly with 65 the data rate [2]. In a phase interpolator (PI)-based digital 66 CDR (preferred choice because of its robustness, portability, 67 and compactness), this requirement can be achieved by either 68 raising the update rate of the CDR logic or increasing the 69 data width of the CDR logic. The update rate is constrained 70 by the synthesized logic speed while the increased data width 71 directly increases the update step size and extends the loop 72 latency that are both prone to enlarge the dithering jitter [2]. 73 The CDR performance is also limited by the PI nonlinearity, 74 which not only deteriorates the uniformity of the phase steps 75 but also causes phase-spacing errors among the multi-phase 76 sampling clocks. The short UI makes the CDR design even 77 more challenging, since there is smaller margin left for the 78 sampling deviation, clock dithering, duty cycle distortion, and 79 quadrature phase errors [2], [5], [17]. 80

0018-9200 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.

48

49

50

51

52

53

54

55

56

57

58

59

60



Fig. 1. Block diagram of the TX chip.

For serial links operating around tens of Gb/s, adaptive 81 equalization has become a dominant option [18]-[20]. One 82 common reason applicable to all data rates is that the prac-83 tical channel diversity and uncertainty make it difficult and 84 unreliable to manually calibrate the equalization parameters. 85 Another reason is that the channel loss variation becomes par-86 ticularly severe for data rates beyond 10 Gb/s. This is because 87 the fast rolling-down channel profile makes the channel loss 88 sensitive to manufacturing errors and ambient environment 89 changes. For example, the insertion loss variation of a CAUI-4 90 compliant channel has been measured to exceed 1.9 dB over 91 a temperature range from -5 to 75 °C at 14 GHz [21]. 92

To alleviate these difficulties and provide potential solutions 93 for ultra-high-speed transceiver design, this paper presents a 94 40-Gb/s quarter-rate SerDes TX and receiver (RX) chipset. 95 The remainder of this paper is organized as follows. Section II 96 describes the TX chip, mainly focusing on the improved 97 4:1 MUX. Section III illustrates the RX chip, where the CDR performance is enhanced by introducing jitter-suppression 99 filters and adopting high-linearity compensating PIs. In Sec-100 tion IV, a low-cost sign-based zero-forcing (S-ZF) adaptation 101 algorithm relying on edge-data cross correlation is designed 102 to achieve adaptive tap-weight adjustment for the TX-feed-103 forward equalizer (FFE). Section V gives the experimental 104 results and performance comparison, and Section VI concludes 105 this paper. 106

107

## **II. TRANSMITTER CHIP**

A. Overall Architecture 108

Fig. 1 shows the block diagram of the TX chip. It contains 109 a multi-MUX-based four-tap FFE combiner, a latch array, 110 an on-chip PRBS generator, and a clock bundle. The par-111 allel quarter-rate data  $D0\langle n \rangle$ ,  $D1\langle n \rangle$ ,  $D2\langle n \rangle$ , and  $D3\langle n \rangle$  are 112 generated by the on-chip PRBS generator, which are then 113 interleavedly latched by the compact latch array to produce the 114 16-path quarter-rate data for the following four 4:1 MUXs. The 115 desired timing relationship (see the signal positions in the latch 116 array), which enables each MUX to share the same timing 117 margin, is satisfied by 90°-spaced quarter-rate clock relatching. 118 The full-rate UI-spaced outputs of the 4:1 MUXs are first 119 buffered by the pre-drivers and then sent to the four-tap FFE 120 combiner. In the clock bundle, a clock conditioner is employed 121



Topology of the 4:1 MUX. (a) Conceptual schematic. (b) Timing Fig. 2. diagram.

to convert the incoming single-end half-rate clock into differ-122 ential outputs, which are then fed into a divider (DIV2) to 123 generate the quart-rate I, Q clocks. After being transformed 124 into full swing by the CML2CMOS converters, these clocks 125 are further applied to four driving buffers and four pseudo-126 AND2s to produce 50% and 25% duty cycle clocks for the 127 latch array and the 4:1 MUXs, respectively. 128

The main feature of the TX chip is the compact implemen-129 tation of the multiple 4:1 MUX-based four-tap FFE, which 130 not only relaxes the stringent timing requirement of the final 131 serialization stage, but also provides a robust approach to 132 support a wide operation range. On the other hand, the dou-133 bled self-drain capacitance in the 4:1 MUX significantly 134 reduces the bandwidth of the MUX, which is the key factor 135 that constrains the maximum operation speed. Additionally, 136 the output performance highly relies on the quality of the 137 multi-phase gating clocks. The remainder of this section will 138 focus on the enhancement of the 4:1 MUX, including topology 139 consideration, unit cell improvement, and clocking techniques. 140

#### B. Topology of the 4:1 MUX

Fig. 2(a) describes the conceptual schematic of the 142 4:1 MUX, which is composed of a pair of shunt-peaked 143 loads and four identical pull-down unit cells. These unit 144 cells are activated sequentially by the UI-spaced clocks 145 (CK0-90-180-270) to combine the four quarter-rate data 146 streams (D0-1-2-3) into one serial sequence (SDATA) [see 147 Fig. 2(b)]. Unlike the 4:1 MUXs presented in [4] and [11] 148 that combine both the ANDing operation and sampling oper-149 ation into the pulling-down unit cell, the unit cell in this 150 design only performs the sampling operation while the ANDing 151 operation is carried out by the pseudo-AND2s in the clock 152 bundle (see Fig. 1). This splitting arrangement allows the four 153 4:1 MUXs in Fig. 1 to share one common ANDing stage, thus 154 exhibiting more potentials on power efficiency. 155



Fig. 3. Traditional unit cell implementations for high-speed 4:1 MUX. (a) Data-up structure. (b) Clock-up structure.



Fig. 4. Improved unit cell implementation. (a) Schematic details. (b) Swing variations for different PVT corners.

#### <sup>156</sup> C. Enhancement on the Unit Cell of the 4:1 MUX

Fig. 3 shows the two widely used traditional unit cells 157 for high-speed 4:1 MUX, where the current source transis-158 tors are eliminated to avoid stacked devices. In the data-159 up structure [11], [22] [see Fig. 3(a)], the output can be 160 corrupted by the data transitions on other branches through the 161 forward-coupling path from the data input to the output when 162 the MUX is performing data selection on one branch [23]. 163 Fig. 3(b) shows the clock-up structure [1], [12], where the 164 forward-coupling path is eliminated by moving the clocking 165 pairs to the top. However, it suffers from severe charge-166 sharing effect between the outputs VOP/VON and junction 167 nodes X/Y. Inspired by the voltage mode source-series ter-168 minated (SST) driver discussed in [24], we introduce a pair 169 of pre-charging transistors PM1/PM2 into the pulling-down 170 unit cell [see Fig. 4(a)]. The pre-charging PM1/PM2 and the 171 data-gating NM1/NM2 actually constitute two inverters, which 172 make nodes X/Y be always pre-driven to desired states, thus 173 eliminating the charge-sharing effect. Compared to the SST 174 implementation in [24], the improved 4:1 MUX exhibits more 175 potentials on high-speed applications. This is because it can 176 fully exploit the process potentials as its compact NMOS 177 driving topology naturally features fast current switching 178 179 speed and small parasitic capacitance. Additionally, the speedconstraining output capacitances, including self-drain load, 180 routing wire, and far-end driving load, can be neutralized by 181 adopting on-chip peaking inductors. In the rest of this part, 182 we will discuss the adverse effect of the charge sharing in 183 conventional clock-up structure and the favorable effect of the 184 185 introduced pre-charging transistors.

1) Charge-Sharing Effect in Conventional Clock-Up Structure: The top row of the simulated waveforms in Fig. 5(a) and (b) demonstrates the two adverse effects of the charge sharing in the conventional clock-up structure [see Fig. 3(b)].



Fig. 5. Effect of the introduced PM on (a) high-level glitches and (b) edge transitions.



Fig. 6. Simulated eye-diagrams of the 4:1 MUX. (a) Without PM. (b) With PM.

Assuming that the upcoming data D0P/D0N are logic 190 high/low, node Y is pre-discharged to the ground through 191 NM2, which helps to speed up the falling edge. The voltage of 192 node X depends on previous transmitted data. In case that the 193 previous D0N is logic low, node X should have been charged 194 to an allowed maximum value (VDD  $- V_{\text{THN}}$ ) during the 195 selection-enabled period (high pulse duration of CK0), which 196 should maintain to the present instant since NM1 has always 197 been in cutoff state. Therefore, this will not cause prominent 198 charge-extraction effect, as node X has already been charged 199 to the allowed maximum value by the previous transmitted 200 bit. If the previous D0N is logic high, node X should keep 201 the ground voltage that is pulled down during the hold time 202 in previous bit period [i.e.,  $T_{hold}$  in Fig. 2(b)]. When the high 203 pulse of CK0 arrives, the capacitance at node X will extract 204 charge from the output, thus causing a remarkable glitch for 205 two consecutive output bits at high level or slowing down the 206 rising edge for a low-to-high transition, as shown in the top 207 row of Fig. 5(a) and (b), respectively. 208

2) Effect of the Introduced Pre-Charging Transistors: To 209 demonstrate the effect of the introduced pre-charging tran-210 sistors PM1/PM2 shown in Fig. 4(a), we take PH0 branch 211 as an example to illustrate the operation process of the 212 proposed pull-down unit cell. When input data arrive, depend-213 ing on D0N/D0P, nodes X/Y are either pre-charged to 214 VDD or pre-discharged to VSS by the two inverters consisting 215 of PM1/PM2 and NM1/NM2. This makes nodes X/Y always 216 in desired states, which are coincident with the output signal 217 levels. Then, NM3/NM4 are turned on to send D0N/D0P to 218 the MUX's outputs as the high level of CK0 comes. After 219 a period of 1 UI, the pull-down path is switched off by the 220 falling edge of CK0 and the voltage level of nodes X/Y stays 221 unchanged until the next input data come. The main feature 222





Fig. 7. Circuit details of the clocking blocks. (a) Clock conditioner. (b) DIV2. (c) CML2CMOS. (d) Pseudo-AND2.

of this 4:1 MUX is its ability of eliminating the charge-223 sharing effect caused by parasitic capacitances at nodes X/Y, 224 which brings in several benefits. First, the deterministic jitter 225 and glitches caused by charge extraction can be remarkably 226 mitigated [see the middle row in Fig. 5(a) and (b)]. The 227 simulated eye-diagrams in Fig. 6 indicate that the inter-symbol 228 interference (ISI) induced by charge sharing is reduced from 229 1.6 to 0.3 ps and the voltage glitches are mostly removed. 230 Moreover, the glitch elimination effectively improves the noise 231 margin that allows a lower output swing to save power. 232 Second, the elimination of the charge-sharing effect makes 233 the capacitances at nodes X/Y less significant. Thus, large-234 size NM1/NM2 can be used to enhance the discharging 235 capabilities. Note that the output swing is determined by 236 the proportion of resistive load and equivalent resistance of 237 stacked NM1/NM3 (NM2/NM4). For a fixed output swing, 238 the big size of NM1/NM2 implies that NM3/NM4's size 239 can be reduced. The smaller size of NM3/NM4 helps to 240 decrease the self-loading drain capacitances of the unit cells. 241 Consequently, the bandwidth of the overall 4:1 MUX can be 242 expanded. Fig. 4(b) gives the swing variation for different 243 process, voltage, and temperature (PVT) corners, which can 244 be controlled under 25%. By adopting a tunable resistor, 245 it can be further reduced [4]. Third, the added transistors 246 PM1/PM2 provide another path through NM3/NM4 to help to 247 pull up the output, which can accelerate the rising transitions. 248

#### 249 D. Clocking Blocks for the 4:1 MUX

As shown in Fig. 1 (bottom), the desired full swing clocks 250 for the latch array and the 4:1 MUXs are produced by a clock 251 bundle, where current mode logic (CML)-style circuits are 252 employed in the clock conditioner and DIV2 to support the 253 most high-speed operation (half-rate) while the CML2CMOS 254 and pseudo-AND2 are implemented in a more power efficient 255 CMOS style. Fig. 7 shows the implementation details of these 256 building blocks. As shown in Fig. 7(a), the clock conditioner 257



Fig. 8. Block diagram of the RX chip.

is composed of an ac-coupled S2D and two cascaded CML 258 buffers, where the former is used to convert the single-end 259 clock input into differential outputs and the latter is utilized to 260 further rectify the clock waveforms. For the DIV2, a traditional 261 inductorless CML latch shown in Fig. 7(b) is used to balance 262 the operation speed and layout compactness. Fig. 7(c) gives the 263 schematic details of the CML2CMOS, where an ac-coupled 264 inverter with a feedback resistor is utilized to convert the CML 265 voltage level to full swing CMOS logic. For the pseudo-AND2, 266 its function is to AND the two 50% duty cycle half-rate clocks 267 with 90° phase shift to generate the 25% duty cycle clocks 268 for the 4:1 MUXs. In this design, a pseudo-NAND2 associated 269 with a driving inverter [see Fig. 7(d)] is employed to perform 270 the ANDing operation [25]. In contrast to conventional NAND2, 271 this pseudo-NAND2 eliminates the pulling-up transistor PM1, 272 thus reducing the output capacitance. The similar circuit real-273 izations of the pseudo-AND2 and the BUF (consisting of two 274 cascaded inverters) also mitigate the delay mismatch between 275  $t_{d1}$  and  $t_{d2}$  (see Fig. 1), which helps to meet the stringent 276 timing constraints against PVT variations. 277

#### III. RECEIVER CHIP

#### A. Overall Architecture

The main task of the RX is to extract the transmitted data 280 from the received signal using appropriate equalization and 28 CDR techniques [26]-[29]. Fig. 8 shows the block diagram of 282 the RX chip. It consists of a two-stage continuous-time linear 283 equalizer (CTLE), a quarter-rate CDR, an FFE adaptation 284 unit, and some testing circuits for the recovered data and 285 clock measurements. The received signal is first equalized 286 by the CTLE and then sliced by eight data/edge samplers, 287 where the sampling clocks are generated by two quarter-rate 288 compensating PIs and the sampling positions are adjusted 289 by a CDR logic using bang-bang phase detectors (BBPDs). 290 In addition, a newly developed S-ZF algorithm along with 291 three 6-bit DACs is adopted to produce the bias voltages 292 for the TX-FFE. The rest of this section focuses on the 293 optimization techniques for the CDR, and the S-ZF algorithm 294 will be elaborated in Section IV. 295



Fig. 9. Conventional BBPD-based CDR.

#### 296 B. Challenges in Conventional BBPD-Based CDR

Fig. 9 shows the conventional architecture of the 297 BBPD-based CDR. Due to the nonlinear behavior and 298 inevitable loop delay, the phase code applied to the PI usually 299 exhibits steady-state oscillation, which brings in substantial 300 deterministic jitter through rotating the PI. This effect can 301 become more severe as the data rate increases, because the 302 increased loop gain and the not-well-scaled loop latency are 303 prone to cause a larger limit-cycle oscillation amplitude. 304 To attenuate this amplitude, a split-path CDR/DFE architecture 305 was proposed in [30], which employs a digital averaging 306 technique to filter the phase code for the separate data-307 sampling clocks. This approach can effectively improve the 308 jitter tolerance (JTOL) amplitude at high frequencies, but 309 the inevitable delay added by the digital averaging block 310 may make the sampling clocks drift away from the optimal 311 positions, thus degrading the maximum tolerable amplitude at 312 low frequencies. 313

Another factor that limits the performance of the 314 BBPD-based CDR is the nonlinearity of the phase-rotating 315 PI, where both the differential nonlinearity (DNL) and integral 316 nonlinearity (INL) can result in serious adverse effects on the 317 overall CDR performance. Specifically, the DNL introduces 318 a much larger phase jump than the ideal one, which can be 319 directly converted into recovered clock jitter. The INL can 320 make the data-sampling clocks drift away from their optimal 321 decision points in quarter-rate architectures using multiple 322 PIs [2]. 323

#### 324 C. Improvement on CDR Architecture

Fig. 10 shows the block diagram of the improved CDR. 325 It employs separate PI1 and PI2 to produce the two sets 326 of 45°-spaced clocks for the data sampling and edge sampling, 327 where passive low-pass filters (LPFs) are introduced into the 328 clock branch for the data sampling to provide extra jitter 329 suppression on the data-sampling clocks. The bandwidth of 330 these introduced LPFs is adaptively adjusted by the same 331 DF(2:0), which is the absolute value of the truncated frequency 332 code generated by the frequency integrator in the digital loop 333 filter. Particularly, the minimum bandwidth of the LPFs is 334 about 4 MHz while the maximum one is around 50 MHz. 335 In addition, a limiter is utilized to set the DF(2:0) to its 336



Fig. 10. Block diagram of the modified CDR architecture.



Fig. 11. Functional view of the introduced LPFs. (a) Principle of the BBPD. (b) Linearized CDR model. (c) Jitter transfer functions.

maximum value when the frequency code goes too large. In principle, a large frequency code indicates a continuous phase slewing to accommodate to the accumulative jitter tracking. Thus, a wide bandwidth is chosen to improve the jitter tracking ability. On the contrary, a small frequency code implies that there is little trackable jitter. Accordingly, a narrow bandwidth is selected to suppress the high-frequency jitter.

The working principle of the BBPD is shown in Fig. 11(a). 344 Considering the fact that the data sampling occurring at the 345



Fig. 12. Effect of the LPFs with a bandwidth of (a) 4 MHz, (b) 20 MHz, (c) 50 MHz, and (d) adaptively adjusting.

center of the eye-diagram serves as a reference to judge 346 whether the edge sampling is leading or lagging the input 347 data transitions, there should be sufficient margin for the data 348 sampling. Accordingly, the outputs of the data samplers show 349 a fairly low sensitivity to phase errors in normal operating 350 CDRs, which means that further jitter suppression on data-351 sampling clocks exhibits little effect on the loop parameters 352 for jitter tracking. Leveraging this characteristic of the BBPD, 353 we introduce LPFs into the data-sampling path to further 354 filter the output jitter while keeping the loop parameters 355 unchanged to satisfy the JTOL specification. Fig. 11(b) shows 356 the small-signal model of the modified CDR, where the 357 LPF located outside of the feedback loop is able to pro-358 vide additional jitter suppression for the data-sampling clocks 359 [see Fig. 11(c)]. Therefore, the dithering jitter caused by 360 the limit-cycle oscillation can be effectively attenuated. The 361 noise sources are also shown in Fig. 11(b), including the 362 input noise  $(S_{IN})$ , quantization noise  $(S_{OBB})$  of the BBPD, 363 truncation noise I  $(S_{\text{TF}})$  due to finite resolution of the integral 364 path, truncation noise II  $(S_{TD})$  due to limited resolution of 365 the IDAC, and nonlinearity noise (SPI1, SPI2) of the PIs. 366 Fig. 11(c) shows the transfer function characteristics for these 367 noise sources. It can be seen that the introduced LPFs can 368 dramatically attenuate the remaining band-frequency and high-369 frequency components from  $S_{TF}$  and  $S_{TD}$ . The low-frequency 370 components of SIN, SPI2, and SQBB can be further reduced by 371 these LPFs when lower bandwidths are employed. In addition, 372 the potential jitter peak can be suppressed to alleviate the jitter 373 amplification problem. 374

Note that the phase delay caused by the LPFs should be small enough to ensure that the data-sampling clocks stay in the vicinity of the optimal sampling point. Otherwise, the highfrequency jitter suppression could be overwhelmed by the delay-caused phase shift, thus deteriorating the overall CDR performance. Fig. 12 shows the filtering effect on the current



Fig. 13. Properties of the adaptive-bandwidth jitter suppression.

mirror bias for 0°-phase and the jitter performance of the 381 data-sampling clock with different LPF bandwidths, where 382 the eye-diagrams are overlapped from 0.9 to 2.1  $\mu$ s. These 383 simulations are performed under the condition that a 500-kHz 384 sinusoidal jitter with a 1 UI amplitude and a 5-ps peak-to-peak 385 random jitter are respectively injected into the input clock and 386 input data using PRBS7. For the simulated diagrams with the 387 bandwidth of 4 MHz in Fig. 12(a), the high-frequency ripples 388 on the bias can be significantly suppressed by the LPF. How-389 ever, the dithering jitter of the data-sampling clock reaches 390 7.54 ps, which is much larger than that of the edge-sampling 391 clock without the LPF (3.04 ps). It means that the CDR 392 performance is actually deteriorated. This is mainly because 393 of the prominent phase shift caused by the LPF delay. As the 394 bandwidth increases, the delay-caused phase shift becomes 395 smaller, thus indicating a descending trend in dithering jitter of 396 the sampling clock [see Fig. 12(b) and (c)]. For the bandwidth 397 fixed at 50 MHz, the dithering jitter of the data-sampling 398 clock (2.66 ps) becomes smaller than that of the edge-sampling 399 clock (3.04 ps). This implies that the jitter optimization con-400 tributed by the bias-ripple suppression overwhelms the delay-401 caused phase shift. Based on the above-mentioned discussion, 402 it can be found that adopting a fixed bandwidth is inadvisable, 403 since the low bandwidth suffers from delay-caused phase shift 404 while the high bandwidth exhibits limited jitter suppression. 405 Fig. 12(d) shows the simulation results when utilizing the 406 proposed bandwidth-adaptively adjusting technique, where the 407 low dithering jitter is achieved by balancing the bias tracking 408 and ripple suppression. When the input pattern ranges from 409 PRBS7 to PRBS15, PRBS23, and PRBS31, the CDR exhibits 410 a similar balance between high-frequency ripple suppression 411 and low-frequency bias tracking but with a slightly increased 412 jitter due to the increased run length of "1s" or "0s". 413

To further explore the bandwidth-adaptively adjusting 414 process, Fig. 13 gives the transient simulation waveforms 415



Fig. 14. Proposed compensating PI. (a) Quarter-rate  $45^{\circ}$ -spaced clock generation. (b) In-phase I, Q clock generation for the data sampling. (c)  $45^{\circ}$  phase-shifted I, Q clock generation for the edge sampling.



Fig. 15. Phase transfer characteristics based on trigonometric-function approximation.

using PRBS7. For the fast input jitter changing region (a jitter 416 tracking region), a large frequency code is accumulated in the 417 frequency integrator (see Fig. 10), thus a high bandwidth con-418 trol code DF(2:0) for the LPFs can be obtained (see the bottom 419 waveform in Fig. 13). As a result, the data-sampling clocks can 420 tightly track the edge-sampling clocks to avoid data-sampling 421 lagging. For the slow input jitter changing region (a jitter 422 suppression region), the frequency code becomes small and 423 so does the bandwidth control code DF(2:0). Correspond-424 ingly, the bandwidth of the LPFs decreases, thus exhibiting 425 prominent jitter suppression effect. Owing to the proposed 426 adaptive bandwidth-adjusting scheme, the jitter suppression 427 and jitter tracking can be automatically balanced in this CDR. 428 Overall, this automatic bandwidth selection technique makes 429 it possible to use a low bandwidth to significantly suppress 430 the high-frequency jitter while exhibiting little effect on the 431 low-frequency jitter tracking ability. 432

#### 433 D. Compensating PI

Fig. 14(a) shows the widely used scheme for  $45^{\circ}$ -spaced clock generation, where two conventional PIs (PIA and PIB) with 1/2-quadrant-step spaced phase codes (PHA $\langle 8:0 \rangle$ and PHB $\langle 8:0 \rangle$ ) are utilized to produce the two sets of  $45^{\circ}$ -spaced clocks (CKA0-90-180-270 and CKB45-135-225-315) [2], [12]. Their phase transfer characteristics based



Fig. 16. Simulation results of the phase-compensating PI. (a) Simulated phase transfer characteristics. (b) DNL performance. (c) INL performance.

on trigonometric-function approximation can be described by 440 the respective red dashed and blue dotted lines in Fig. 15. 441 When PIA rotates to point E and PIB rotates to point F, 442 the phase shift between them can reach a maximum of 8.1° 443 (or 0.09 UI). Since the edge-sampling clocks tightly track the 444 edge transitions in the received data stream, any phase-spacing 445 variation between the edge-sampling and data-sampling clocks 446 could make the data-sampling clocks drift away from the 447 expected decision point. Moreover, improving the PI resolution 448 cannot optimize this effect since fine step weights cannot 449 change the shape of the phase transfer characteristics. 450

To address these issues, we develop a phase-compensating 451 technique, which applies four time averaging (TA) 452 [see Fig. 14(b) and (c)] to further average the two sets 453 of 45°-spaced clocks. Specifically, the data-sampling 454 clocks (CK0-90-180-270) are obtained by averaging 455 CKA0-90-180-270 and CKB45-135-225-315, while the 456 edge-sampling clocks (CK45-135-225-315) are attained 457 by averaging CKA90-180-270-0 and CKB45-135-225-315. 458



Fig. 17. Implemented equalization scheme with the proposed S-ZF algorithm.

Mathematic analysis shows that the phase transfer function of 459 the proposed compensating PI is a combination of two arctan 460 functions given in Fig. 15, where a more linear phase transfer 461 curve with negligible phase deviations smaller than 0.17° can 462 be achieved. In practical implementation (see the schematic 463 details of PI and TA in Fig. 14), the linearity optimization 464 degraded by the transistors' inherent nonlinearity and is 465 nonideal input clock waveform. Simulation results shown 466 in Fig. 16 imply that the INL can be controlled below 2.5 LSB 467 (or  $1.8^{\circ}$ ), which is only a quarter of that of the conventional 468 PI. The simulation also shows that the additional PI and TAs 469 in each compensating PI consume around 10 mW. 470

#### IV. CHANNEL EQUALIZATION

The equalization scheme consisting of a TX-FFE and an 472 RX-CTLE is utilized to compensate for the channel loss. 473 As shown in Fig. 17, the RX-CTLE is manually calibrated 474 while the tap weights of the TX-FFE are adaptively adjusted 475 by an edge-data correlation-based S-ZF algorithm in the RX 476 side. The digital tap weights generated by the S-ZF engine are 477 first constrained by three range limiters and then applied to 478 three 6-bit DACs to produce the bias voltages for the TX-FFE 479 taps. These bias voltages are transferred to the TX through 480 PCB traces. The TX-FFE is performed by a CML-based four-481 tap FFE combiner, where the tap weights are adjusted by 482 changing the bias voltages of the current sources (see Fig. 1). 483 The RX-CTLE schematic details and its frequency responses 484 are described in Fig. 18. 485

#### 486 A. Previous Adaptation Algorithms

According to different evaluation criteria [18]-[20], 487 [31]–[34], previous adaptation algorithms for wireline com-488 munications can be mainly categorized into sign-sign least 489 mean square (SS-LMS) [18]–[20], [31], ZF [32], [33], and 490 maximum eye opening (MEO) [34]. A common drawback of 491 these methods is that they need auxiliary circuits to extract the 492 error information. Particularly, the SS-LMS algorithm requires 493 additional samplers to detect the signed errors between the 494 equalized and expected eye heights [18]-[20], [31]. The tra-495 ditional ZF necessitates an extra ADC to convert the equal-496 ized output voltages into digital codes [32], [33]. The MEO 497 requests an even more complicated eye monitor, which usually 498

incorporates threshold-adjusting samplers, phase-adjusting PIs, 499 micro-controller, and measurement software [34], to measure 500 the internal eye opening. These auxiliary circuits make these 501 methods less competitive for applications at tens of Gb/s due 502 to the following reasons: 1) maximum bandwidth deterioration 503 because their input capacitances are directly connected to the 504 critical signal path; 2) substantial power consumption as the 505 additional circuits usually operate at high speed; and 3) more 506 complicated layout placing and routing. 507

#### B. Edge-Data Correlation-Based S-ZF Adaptation Algorithm 508

To preclude the auxiliary circuits in previous adaptation algorithms [18]–[20], [31]–[34], a low-cost S-ZF algorithm utilizing edge-data cross correlation is developed. The target is to force the cross correlation between the sign of the edge-sampling error and received data to zero. The iterative procedure of the TX-FFE tap weights is given by

$$\alpha_l(k+1) = \alpha_l(k) - \lambda \cdot \operatorname{sign}[e(k)] \cdot D(k-l)$$
<sup>515</sup>

(l = -1, 0, 1, 2) (1) 516

532

533

where  $\alpha_l(k)$  is the instant *l*-tap weight, sign[e(k)] represents 517 the sign of the edge-sampling error, D(k) denotes the recov-518 ered data, and  $\lambda$  stands for the scale factor controlling the 519 adjustment rate and its value is usually much smaller than 1. 520 The sign of the edge-sampling error sign[e(k)] caused by the 521 ISI is directly mapped from the quantized edge sequence E(k), 522 and it is correlated with the data bit D(k-l) to produce the 523 product sign $[e(k)] \cdot D(k-l)$ . The result is then integrated to 524 update the FFE tap weight  $\alpha_l(k)$ . 525

The main feature of this approach is that it only involves the existing quantized edge sequence E(k) and recovered data sequence D(k). As a result, the essential auxiliary circuits, such as samplers, ADCs, and PIs in previous adaptive equalizations [18], [19], [31]–[34], are removed, thus exhibiting more potentials on operation speed and cost effectiveness. 528

### C. Derivation of the Edge-Data Correlation-Based S-ZF Adaptation

For a TX with *l*-tap UI-spaced FFE, the pre-distorted output 534 can be represented by 535

$$t(k) = \sum_{l} \alpha_l I(k-l) \tag{2}$$
 536



Fig. 18. RX-CTLE. (a) Schematic details. (b) Frequency responses for different control voltages.



Fig. 19. Block diagram of the edge-data correlation-based S-ZF adaptation algorithm.

where I(k) is the transmitting sequence,  $\alpha_l$  denotes the tap weight, and l is the tap index [34]. To make the analysis more compact, the cascaded passive channel and RX-CTLE are treated as a combined channel with a new pulse response of  $c_k$ . By calculating the convolution of pre-distorted output t(k) and the channel pulse response  $c_k$ , the received discretetime sequence before binary quantization can be given by

$$r(k) = \sum_{l} \alpha_l \left( \sum_{i} I(i) c_{k-l-i} \right).$$
(3)

54

According to the discussion in [35], the cross-correlation coefficient  $\rho_{y,x}(n)$  between the output signal y(m) and the input signal x(m) is exactly equal to the impulse response h(n). Applying this conclusion and replacing the recovered data sequence D(k) with the input sequence I(k), we attain the cross-correlation coefficient between the edge-sampling error sequence D(k) and the recovered data sequence D(k)

552 
$$\hat{\rho}_{e,d}(n) = \sum_{l} \alpha_l c_{n-l+0.5}.$$
 (4)

The reason why I(k) can be considered equivalent to D(k) is because the bit error rate (BER) is usually quite low (<1e-12) for normal operation links.

For an ideally equalized serial link, the edge-sampling error sequence is supposed to be a 0-sequence. Hence, all the crosscorrelation coefficients should be zero. However, this needs infinite taps to cancel all the residual ISI. Considering the fact that the ISI tail decreases exponentially as the time goes 560 on, it is reasonable to assume that the ISI affects a finite 561 number of symbols and previous research has demonstrated 562 that equalizers with a specific number of taps can effectively 563 compensate for legacy channels [17], [19], [31], [34], [36]. 564 In principle, when the tap weights are adjusted close to 565 the targeted values, the resulting cross-correlation coefficient 566  $\hat{\rho}_{e,d}(n)$  should be forced toward zero. Taking the implemented 567 four-tap FFE in this design as an example, for a group of 568 proper tap weights, we have 569

$$\hat{\rho}_{e,d} = C\alpha = 0 \tag{5}$$

571

where

$$\hat{\rho}_{e,d} = (\hat{\rho}_{e,d}(-1), \hat{\rho}_{e,d}(0), \hat{\rho}_{e,d}(1), \hat{\rho}_{e,d}(2))^T$$

$$\alpha = (\alpha_{-1}, \alpha_0, \alpha_1, \alpha_2)^T$$
57

$$\begin{pmatrix} \alpha_{-1}, \alpha_0, \alpha_1, \alpha_2 \end{pmatrix}^{-573}$$

$$C = \begin{pmatrix} c_{0.5} & c_{-0.5} & c_{-1.5} & c_{-2.5} \\ c_{1.5} & c_{0.5} & c_{-0.5} & c_{-1.5} \\ c_{2.5} & c_{1.5} & c_{0.5} & c_{-0.5} \\ c_{3.5} & c_{2.5} & c_{1.5} & c_{0.5} \end{pmatrix}.$$

To find the optimal TX-FFE tap weights, a recursive equation 575 is constructed as 576

$$\alpha(k+1) = \alpha(k) - \lambda C \alpha(k) = \alpha(k) - \lambda \hat{\rho}_{e,d}(k). \quad (6) \quad 577$$

In each iteration, a small portion of the instant crosscorrelation coefficient vector  $\lambda \hat{\rho}_{e,d}(k)$  is subtracted from the



Fig. 20. CD. (a) Operation principle illustration. (b) Function table.

tap-weight vector  $\alpha(k)$  to make it closer to the targeted value. 580 For the convergence, mathematic analysis indicates that a suffi-581 cient condition is to keep the 1-norm of matrix  $I - \lambda C$  smaller 582 than 1 (i.e., the maximum absolute column sum is smaller 583 than 1). For any bandwidth-limited channel, the transmitted 584 symbol will spread over multiple symbols at the RX side, 585 thus making the above conditions held. Consequently, a set of 586 optimal tap weights of the TX-FFE can be obtained by the 587 iterative (6). To handle unexpected divergence, range limiters 588 are inserted between the S-ZF and DACs (see Fig. 17) to keep 589 the control codes received by the DACs not larger (or smaller) 590 than the specific maximum (or minimum) values. 591

Taking sign[e(k)] as the binary quantization of the edgesampling error, the cross correlation between the sign of the edge-sampling error and received data: sign[e(k)]  $\cdot D(k - l)$ can be considered as an instant estimation of  $\hat{\rho}_{e,d}(l)$ . Hence, the final iterative equation presented in previous part can be obtained [refer to (1)].

#### 598 D. Implementation of the Edge-Data Correlation-Based S-ZF

Fig. 19 shows the implementation of our S-ZF adaptation 599 algorithm, which contains three identical paths to process 600 the quantized data/edge sequences to produce the desired 601 bias voltages for TX-FFE taps. Here, the main tap weight 602 is pre-fixed to accelerate the convergence speed. In each 603 path, the edge and data streams with proper time shift are 604 applied to a correlation detector (CD) to generate the residual 605 correlation ResCor<sub>l</sub>(n), which denotes sign[e(n)] · D(n - l)606 in (1). These parallel correlation coefficients are first summed 607 and then fed into a 16-bit integrator to execute the iteration 608 of (1), where  $\lambda$  is determined by the subsequent truncation 609 operation. In this design, a set of consecutive 4-bit data/edge 610 of the 1/16-rate demultiplexed data/edge are employed, which 611 ensures that the data/edge information used for equalization 612 adaptation comes from different samplers. This decentralized 613 error collection method reduces the possibility of non-optimal 614 adaptation caused by imperfections, such as fabrication mis-615 match, duty cycle distortion, and I, Q quadrature error. Fig. 20 616



Fig. 21. Transistor-level simulation of the S-ZF adaptation. (a) Channel frequency response. (b) Convergence process of the TX-FFE tap weights. (c) Eye-diagram with zero TX-FFE tap weights. (d) Eye-diagram with adaptively adjusted TX-FFE tap weights.

further details the operation principle and function table of the CD. Clearly, if there is no transition  $[D(n) \oplus D(n + 1)] = 0]$ , ResCor<sub>l</sub>(n) is assigned 0. In case of a data transition  $[D(n) \oplus D(n+1)] = 1]$ , ResCor<sub>l</sub>(n) is assigned +1 or -1 when the polarities of D(n-l) and E(n) are identical  $[D(n-l) \oplus E(n)] = 0]$  or opposite  $[D(n-l) \oplus E(n)] = 1]$ .

Fig. 21 gives the transistor-level simulation results of the 623 serial link with the S-ZF adaptation, where the control voltage 624 of the RX-CTLE is pre-set to 700 mV, and the dispersive 625 channel is imitated by an LPF with a -15.9 dB loss at 20 GHz. 626 The channel frequency response and the eye-diagram after 627 the channel are shown in Fig. 21(a). Fig. 21(b) describes the 628 convergence process of the TX-FFE tap weights. Fig. 21(c) 629 and (d) shows the eye-diagrams (measured at the output of 630 the RX-CTLE) with zero and adaptively adjusted tap weights, 631 respectively. It can be easily seen that the developed S-ZF 632 adaptation algorithm can gradually tune the TX-FFE tap 633 weights to optimal values, which can effectively optimize the 634 eye opening and eyelid thickness. 635

#### V. EXPERIMENTAL RESULTS

The TX and RX chips are fabricated in a 65-nm CMOS 637 process. The chips are mounted on PCBs through wire bonding 638 and they are connected to the testing instruments via SMA 639 connectors and connection cables. Fig. 22 shows the micro-640 graphs and power breakdown when applying a 1.2-V supply 641 at 40 Gb/s. The TX chip occupies an area of 0.6 mm<sup>2</sup> and 642 consumes a total power of 145 mW with a 400-mV single-643 end swing. The RX chip occupies 1.92 mm<sup>2</sup> (including the 644 testing circuits) and dissipates 225-mW power (excluding the 645 testing circuits). 646

#### A. Transmitter Chip Measurement

The TX output is measured after a channel consisting of 648 a doubled bonding wire, a 4-cm PCB trace, and a 0.5-m 649

636



Fig. 22. Micrographs and power breakdown of (a) TX chip and (b) RX chip.



Fig. 23. Measured output eye-diagrams of the TX at (a) 5 Gb/s with over equalization, (b) 40 Gb/s without equalization, (c) 40 Gb/s with proper equalization, and (d) 50 Gb/s with proper equalization.

connection cable. Fig. 23(a) shows the over-equalized eye-650 diagram at 5 Gb/s, where the four sub-levels are contributed 651 by the four FFE taps. Fig. 23(b) and (c) gives the output eye-652 diagrams at 40 Gb/s before and after applying the four-tap 653 FFE. Obviously, the FFE can significantly improve the eye 654 opening. The eye height and eye width are optimized from 655 140 mV and 0.45 UI to 180 mV and 0.68 UI, respectively. 656 Meanwhile, the thickness of the eyelid is dramatically reduced 657 from around 330 to 140 mV. Fig. 23(d) shows the properly 658



Fig. 24. Measured output eye-diagrams with four separate eyes. (a) Clock pattern. (b) PRBS pattern.

compensated eye-diagram at the maximum operation speed 659 of 50 Gb/s. Its eye height and eye width are 50 mV and 660 0.38 UI. Clearly, a wide operation range from 5 to 50 Gb/s is 661 achieved, which is mainly attributed to the multi-MUX-based 662 FFE implementation. Fig. 24 further shows the TX output 663 with four separate eyes. It can be seen that the horizontal 664 eye widths for both fixed clock and PRBS patterns are almost 665 identical, thus proving that the four sampling phases are 666 properly aligned. 667

### B. Receiver Chip Measurement

The RX standalone measurement results are presented in this part. Fig. 25(a) shows the eye-diagram of the 40-Gb/s input data generated by an Anritsu MP1812A through combining four 10-Gb/s PRBS7 sequences, where the single-end eye height and eye width are around 360 mV and 0.71 UI. Fig. 25(b) shows the eye-diagram of the 10-Gb/s recovered 674

TABLE I Performance Summary and Comparison

| Transmitter Chip        |           |              |             |              |  |
|-------------------------|-----------|--------------|-------------|--------------|--|
|                         | [4]       | [16]         | [20]        | This work    |  |
| Technology (nm)         | 65        | 14           | 65          | 65           |  |
| Supply (V)              | 1.2       | N/A          | 1.2         | 1.2          |  |
| Data Rate (Gb/s)        | 50-64     | 16-40        | 38.8-42     | 5-50         |  |
| Area (mm <sup>2</sup> ) | 1.2 × 1.0 | 0.215 × 0.13 | 0.9 × 0.7   | 1.2 × 0.5    |  |
| TX Equalization         | 4-tap FFE | 4-tap FFE    | 5-tap FFE   | 4-tap FFE    |  |
| 1UI-Delay Gen.          | LC-Delay  | Multi-MUX    | CML-Delay   | Multi-MUX    |  |
| MUX Type                | 4:1       | 4:1          | N/A         | 4:1          |  |
| Data Jitter             | N/A       | 0.33@28Gb/s  | N/A         | 0.23@40Gb/s  |  |
| RJ (ps <sub>rms</sub> ) |           | 0.51@40Gb/s  |             | 0.18@50Gb/s  |  |
| Data Jitter (ps)        | N/A       | 10.72@28Gb/s | 15.1@40Gb/s | 9.90@40Gb/s  |  |
| TJ (BER=1e-12)          |           | 12.89@40Gb/s |             | 10.58@50Gb/s |  |
| Power (mW)              | 199*      | 518          | 135**       | 145*         |  |
| Energy Efficiency       | 3.1       | 12.9         | 3.4         | 3.6          |  |
| (pJ/bit)                |           |              |             |              |  |

| Receiver Chip                   |                       |                  |                                        |  |  |
|---------------------------------|-----------------------|------------------|----------------------------------------|--|--|
|                                 | [2]                   | [17]             | This work                              |  |  |
| Technology (nm)                 | 28                    | 22               | 65                                     |  |  |
| Supply (V)                      | 1.1/0.85              | 1.07             | 1.2                                    |  |  |
| Data Rate (Gb/s)                | 40                    | 4-32             | 40                                     |  |  |
| Area (mm²)                      | 0.81/Lane*            | 0.079/Lane       | 1.92                                   |  |  |
| Sampling Rate                   | Quarter-Rate          | Half-Rate        | Quarter-Rate                           |  |  |
| Equalization<br>Adaptation      | Minimum-BER<br>Engine | SS-LMS<br>Engine | Edge-Data<br>Correlation<br>Based S-ZF |  |  |
| Multi-phase Gen.                | DLL+PIs               | MCDLL+PIs        | DIV2+PIs                               |  |  |
| Jitter Suppression<br>Technique | Split-Path<br>CDR     | N/A              | Adaptive-BW<br>LPFs                    |  |  |
| JTOL Amplitude (UI)             | 0.2@80MHz             | 0.2@40MHz        | 0.41@100MHz                            |  |  |
| JTOL Bandwidth (MHz)            | 10                    | 20**             | 20                                     |  |  |
| Power (mW)                      | 630                   | 79.64            | 225                                    |  |  |

\*Exclude PLL power consumption, \*\* Standalone FFE without data serializer.



Fig. 25. Measured eye-diagrams for (a) input data at 40 Gb/s, (b) recovered data at 10 Gb/s, (c) recovered edge-sampling clock without LPFs at 5 GHz, and (d) recovered data-sampling clock with LPFs at 5 GHz.

data with a total jitter of 12.73 ps. The eye-diagrams of the 675 recovered clocks (divided by 2) for the edge sampling and 676 data sampling are shown in Fig. 25(c) and (d), which reveal 677 that the introduced LPFs can optimize the total jitter from 678 11.48 to 7.66 ps. To demonstrate the effect of the LPFs with 679 adaptively adjusting bandwidth, the jitter transfer (JTRAN) 680 and JTOL curves are measured using a Tektronix BSA286C 681 with a CDR block. The input peak-to-peak swing is tuned 682 to 800 mV and the control voltage of the CTLE is manually 683 set to 710 mV. The JTRAN curves in Fig. 26 illustrate that 684 the bandwidth of the data-sampling path depending on the 685 LPFs is 4 MHz, which is much smaller than 18 MHz for 686 the edge-sampling path determined by the loop parameters. 687 The measured JTOL in Fig. 26 indicates that the embedded 688 LPFs result in a significant dip attenuation around the corner 689 frequency and improve the JTOL amplitudes apparently at 690

\* Area of whole transceiver, \*\* Estimated from jitter tolerance results.



Fig. 26. Measured JTRAN and JTOL with PRBS7 at 28 Gb/s.

high jitter frequencies. Meanwhile, the adaptively adjusting bandwidth of the LPFs makes them exhibit little effect on the phase-tracking slew rate at low jitter frequencies. Additionally, the corner frequency of the JTOL is about 20 MHz, which is much larger than the JTRAN bandwidth of 4 MHz.

696

#### C. Adaptive Equalization Validation

To demonstrate the effectiveness of the developed edge-697 data cross correlation-based S-ZF algorithm, a chip-to-chip 698 interconnect is constructed, as shown in Fig. 27(a). The outputs 699 of the TX chip and the inputs of the RX chip are separately 700 wire bonded to the two terminals of a 12-cm PCB channel. 701 An auxiliary PCB with a TX chip bonding to a replica 702 channel is manufactured to measure the far-end eye-diagrams. 703 Fig. 27(b) shows the frequency response of the PCB channel, 704



Fig. 27. Constructed chip-to-chip interconnect. (a) PCB photograph (b) Channel frequency response.



Fig. 28. Adaptively adjusted bias voltages of the TX-FFE with different RX-CTLE control voltages.

where the channel loss at the half-baud frequency is over 705 16 dB. Fig. 28 shows the adaptively adjusted bias voltages 706 of the TX-FFE taps as the control voltage of the RX-CTLE 707 changes from 900 to 615 mV [see the corresponding equaliza-708 tion abilities in Fig. 18(b)] when operating at 40 Gb/s. Fig. 29 709 shows the far-end eye-diagrams under the bias conditions of 710 A, B, D, and F shown in Fig. 28. As the control voltage of 711 the RX-CTLE is decreased (i.e., improving the high-frequency 712 peaking ability of the RX-CTLE), the TX-FFE bias voltages 713 are adjusted accordingly to decrease the equalization capability 714



Fig. 29. Measured far-end eye-diagrams for (a) bias condition A, (b) bias condition B, (c) bias condition D, and (d) bias condition F shown in Fig. 28.



Fig. 30. Measured bathtub curves under different bias conditions shown in Fig. 28.

of the TX-FFE, thus maintaining the frequency response of 715 the combined TX-FFE, RX-CTLE, and transmission channel 716 close to a flat profile. By detecting the BER while adjusting 717 the sampling positions, the bathtub diagram can be obtained. 718 Fig. 30 shows the measured bathtub curves under the bias 719 conditions of A, C, and F described in Fig. 28. For the 720 balanced equalization coefficient allocation under bias condi-721 tion C, the horizontal eye opening at BER = 1e-12 achieves 722 0.51 UI, which is much better than those measured under 723 bias condition A (0.30 UI) and bias condition F (0.35 UI). 724 This proves that a combination scheme of the TX-FFE and 725 RX-CTLE is a good choice for the equalization of the 40-Gb/s 726 link. 727

#### D. Performance Summary and Comparison

The performance summary and comparison with previous 729 studies are given in Table I. The results indicate that the TX 730 chip achieves good jitter performance and power efficiency, 731 even in comparison with the TX embedding LC-delay-based 732 FFE [4] and the standalone CML-based FFE combiner [20]. 733 This is mainly because of the proposed high-speed 4:1 MUX 734

743

and the compact interleaved-latching scheme. For the RX, 735 the maximum tolerable amplitude of sinusoidal jitter at high 736 frequency outperforms the other two, owing to the introduced 737 LPFs and the developed compensating PI. By removing the 738 auxiliary circuits for error information extraction, the proposed 739 edge-data correlation-based S-ZF algorithm not only avoids 740 introducing the capacitance overhead to the critical path, but 741 also helps to optimize the power efficiency. 742

#### VI. CONCLUSION

This paper implements a 40-Gb/s TX and RX chipset 744 over a >16-dB loss PCB channel using a 65-nm CMOS 745 process. The TX utilizes a bandwidth-enhanced 4:1 MUX 746 and an interleaved-retiming latch array to obtain wide oper-747 ation range, high power efficiency, and small area occupa-748 tion. By introducing bandwidth-adaptively adjusting LPFs into 749 the clock path for data sampling, the CDR achieves high 750 performance on both low-frequency jitter tracking and high-751 frequency jitter suppression. To further improve the CDR per-752 formance, a TA-based compensating PI is designed to optimize 753 the phase-step uniformity and reduce the phase-spacing shift 754 between edge-sampling and data-sampling clocks. A combined 755 TX-FFE and RX-CTLE is employed to compensate for the 756 channel loss, where a low-cost edge-data correlation-based 757 S-ZF adaptation algorithm is proposed to automatically adjust 758 the TX-FFE's tap weights. 759

#### ACKNOWLEDGMENT

The authors would like to thank Dr. J. Jia and 761 Dr. Y. Gao for their discussions on the convergence analysis 762 of the iterative equation. 763

779

760

#### References

- [1] U. Singh et al., "A 780 mW 4 × 28 Gb/s transceiver for 100 GbE 765 gearbox PHY in 40 nm CMOS," IEEE J. Solid-State Circuits, vol. 49, 766 no. 12, pp. 3116-3129, Dec. 2014. 767
- [2] R. Navid et al., "A 40 Gb/s serial link transceiver in 28 nm CMOS 768 769 technology," IEEE J. Solid-State Circuits, vol. 50, no. 4, pp. 814-827, Apr. 2015. 770
- [3] P.-C. Chiang, J.-Y. Jiang, H.-W. Hung, C.-Y. Wu, G.-S. Chen, and J. 771 Lee, "4×25 Gb/s transceiver with optical front-end for 100 GbE system 772 in 65 nm CMOS technology," IEEE J. Solid-State Circuits, vol. 50, no. 2, 773 pp. 573-585, Feb. 2015. 774
- M.-S. Chen and C.-K. K. Yang, "A 50-64 Gb/s serializing transmitter [41 775 776 with a 4-tap, LC-ladder-filter-based FFE in 65 nm CMOS technology, IEEE J. Solid-State Circuits, vol. 50, no. 8, pp. 1903-1916, Aug. 2015. 777
- J. Lee et al., "Design of 56 Gb/s NRZ and PAM4 SerDes transceivers in CMOS technologies," *IEEE J. Solid-State Circuits*, vol. 50, no. 9, [5] 778 pp. 2061-2073, Sep. 2015. 780
- T. Takemoto et al., "A 25-Gb/s 2.2-W 65-nm CMOS optical transceiver 781 [6] using a power-supply-variation-tolerant analog front end and data-format 782 783 conversion," IEEE J. Solid-State Circuits, vol. 49, no. 2, pp. 471-485, 784 Feb. 2014.
- [7] B. Welch. (May 2014). 400G Optics-Technologies, Timing, and Trans-785 ceivers. Accessed: Oct. 22, 2016. [Online]. Available: http://www. 786 ieee802.org/3/bs/public/14\_05/welch\_3bs\_01\_0514.pdf 787
- InfiniBand Roadmap. Accessed: Oct. 22, 2016. [Online]. Available: 788 [8] http://www.infinibandta.org/content/pages.php?pg=technology\_overview 789
- [9] P.-C. Chiang, H.-W. Hung, H.-Y. Chu, G.-S. Chen, and J. Lee, "60 Gb/s 790 NRZ and PAM4 transmitters for 400 GbE in 65 nm CMOS," in IEEE 791 Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, 792 pp. 42-43. 793
- [10] H. Tao et al., "40-43-Gb/s OC-768 16:1 MUX/CMU chipset with 794 SFI-5 compliance," IEEE J. Solid-State Circuits, vol. 38, no. 12, 795 pp. 2169-2180, Dec. 2003. 796

- [11] A. A. Hafez, M.-S. Chen, and C.-K. K. Yang, "A 32-48 Gb/s serializing transmitter using multiphase serialization in 65 nm CMOS technology. IEEE J. Solid-State Circuits, vol. 50, no. 3, pp. 763-775, Mar. 2015.
- [12] B. Raghavan et al., "A sub-2 W 39.8-44.6 Gb/s transmitter and receiver chipset with SFI-5.2 interface in 40 nm CMOS," IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3219-3228, Dec. 2013.
- [13] K. Kanda et al., "A single-40 Gb/s dual-20 Gb/s serializer IC with SFI-5.2 interface in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3580-3589, Dec. 2009.
- [14] S. Kaeriyama et al., "A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5 interface for optical transmission systems." IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3568-3579, Dec. 2009.
- [15] P. Chiang, W. J. Dally, M. J. E. Lee, R. Senthinathan, Y. Oh, and M. A. Horowitz, "A 20-Gb/s 0.13-µm CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer," IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 1004-1011, Apr. 2005.
- [16] J. Kim et al., "A 16-to-40 Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14 nm CMOS," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015, pp. 60-61.
- T. Musah et al., "A 4-32 Gb/s bidirectional link with 3-tap FFE/6-tap [17] DFE and collaborative CDR in 22 nm CMOS," IEEE J. Solid-State Circuits, vol. 49, no. 12, pp. 3079-3090, Dec. 2014.
- C. Thakkar, L. Kong, K. Jung, A. Frappe, and E. Alon, "A 10 Gb/s [18] 45 mW adaptive 60 GHz baseband in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 47, no. 4, pp. 952-968, Apr. 2012.
- [19] J. Jaussi et al., "A 205 mW 32 Gb/s 3-tap FFE/6-tap DFE bidirectional serial link in 22 nm CMOS," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 440-441.
- M.-S. Chen, Y.-N. Shih, C.-L. Lin, H.-W. Hung, and J. Lee, "A fully-[20] integrated 40-Gb/s transceiver in 65-nm CMOS technology," IEEE J. Solid-State Circuits, vol. 47, no. 3, pp. 627-640, Mar. 2012.
- [21] A. Cavaciuti et al. (Jul. 2014). CAUI4 Channel Loss Variation Due to Temperature. Accessed: Oct. 22, 2016. [Online]. Available: http://www. ieee802.org/3/bm/public/jul14/interim/tooyserkani\_01\_0714\_optx.pdf
- [22] H. Wang and J. Lee, "A 21-Gb/s 87-mW transceiver with FFE/DFE/analog equalizer in 65-nm CMOS technology," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 909-920, Apr. 2010.
- [23] D. Cui et al., "A dual-channel 23-Gbps CMOS transmitter/receiver chipset for 40-Gbps RZ-DQPSK and CS-RZ-DQPSK optical transmission," IEEE J. Solid-State Circuits, vol. 47, no. 12, pp. 3249-3260, Dec. 2012.
- [24] C. Menolfi et al., "A 28 Gb/s source-series terminated TX in 32 nm CMOS SOI," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2012, pp. 334-335.
- X. Zheng et al., "A 5-50 Gb/s quarter rate transmitter with a 4-tap [25] multiple-MUX based FFE in 65 nm CMOS," in Proc. IEEE Eur. Solid-State Circuits Conf., Sep. 2016, pp. 305-308.
- [26] R. Reutemann et al., "A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with optional cleanup PLL in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2850-2860, Dec. 2010.
- B. Casper and F. O'Mahony, "Clocking analysis, implementation and [27] measurement techniques for high-speed data links-A tutorial," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 1, pp. 17-39, Jan. 2009.
- [28] N. Kalantari and J. F. Buckwalter, "A multichannel serial link receiver with dual-loop clock-and-data recovery and channel equalization," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 11, pp. 2920-2931, Nov. 2013.
- L. Rodoni, G. von Buren, A. Huber, M. Schmatz, and H. Jackel, [29] "A 5.75 to 44 Gb/s quarter rate CDR with data rate selection in 90 nm bulk CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 7, pp. 1927-1941, Jul. 2009.
- [30] M. Hossain et al., "A 4×40 Gb/s quad-lane CDR with shared frequency tracking and data dependent jitter filtering," in IEEE Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2014, pp. 1-2.
- [31] M. Pozzoni et al., "A multi-standard 1.5 to 10 Gb/s latch-based 3-tap DFE receiver with a SSC tolerant CDR for serial backplane communication," IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1306-1315, Apr. 2009.
- [32] J. W. M. Bergmans, Digital Baseband Transmission and Recording. Dordrecht, The Netherlands: Springer, 1996.
- H. Higashi et al., "A 5-6.4-Gb/s 12-channel transceiver with pre-[33] emphasis and equalization," IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 978-985, Apr. 2005.
- [34] K. Krishna et al., "A multigigabit backplane transceiver core in  $0.13-\mu m$  CMOS with a power-efficient equalization architecture," *IEEE* J. Solid-State Circuits, vol. 40, no. 12, pp. 2658-2666, Dec. 2005.

- [35] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Princi-873 ples, Algorithms, and Applications, 4th ed. Pearson, 2006. 874
- 875 [36] H. Kimura et al., "A 28 Gb/s 560 mW multi-standard SerDes with singlestage analog front-end and 14-tap decision feedback equalizer in 28 nm 876 877 CMOS," IEEE J. Solid-State Circuits, vol. 49, no. 12, pp. 3091-3103,
- Dec. 2014 878



Xuqiang Zheng received the B.S. and M.S. degrees from the School of Physics and Electronics, Central South University, Hunan, China, in 2006 and 2009, respectively. He is currently pursuing the Ph.D. degree with the University of Lincoln, Lincoln, U.K. Since 2010, he has been a Mixed Signal Engineer with the Institute of Microelectronics, Tsinghua University, Beijing, China. His current research interests include high-performance A/D converters and highspeed wireline communication systems.



889



Chun Zhang (M'03) received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1995 and 2000, respectively.

Since 2000, he has been with Tsinghua University, where he was with the Department of Electronic Engineering from 2000 to 2004 and he has been an Associate Professor with the Institute of Microelectronics since 2005. His current research interests include mixed signal integrated circuits and systems, embedded microprocessor design, digital

signal processing, and radio frequency identification.

908

901



Fangxu Lv received the B.S. and M.S. degrees from Air Force Engineering University, Xi'an, China, in 2011 and 2014, respectively. He is currently pursuing the Ph.D. degree with Tsinghua University, Beijing, China.

His current research interests include high-speed wireline system design.



924

927

928

929

Feng Zhao received the B.Eng. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, in 2000, and the M.Phil. and Ph.D. degrees in computer vision from The Chinese University of Hong Kong, Hong Kong, in 2002 and 2006, respectively.

From 2006 to 2007, he was a Post-Doctoral Fellow with the Department of Information Engineering, The Chinese University of Hong Kong. From 2007 to 2010, he was a Research Fellow with the School of Computer Engineering, Nanyang

Technological University, Singapore. He was then a Post-Doctoral Research Associate with the Intelligent Systems Research Centre, University of Ulster, Londonderry, U.K. From 2011 to 2015, he was a Workshop Developer and a Post-Doctoral Research Fellow with the Department of Computer Science, Swansea University, Swansea, U.K. From 2015 to 2017, he was a Post-923 Doctoral Research Fellow with the School of Computer Science, University of Lincoln, Lincoln, U.K. Since 2017, he has been with the Department 925 of Computer Science, Liverpool John Moores University, Liverpool, U.K., 926 where he is currently a Senior Lecturer. His research interests include image processing, biomedical image analysis, computer vision, pattern recognition, machine learning, artificial intelligence, and robotics.



Shuai Yuan received the B.S. and Ph.D. degrees from the Institute of Microelectronics, Tsinghua University, Beijing, China, in 2011 and 2016, respectively.

He is currently a Post-Doctoral Researcher with 934 the Institute of Microelectronics, Tsinghua Uni-935 versity. His current research interests include 936 high-speed wireline transceivers and low-power 937 equalizers. 938



Shigang Yue (M'05-SM'17) received the 939 B.Eng. degree from Qingdao Technological 940 University, Shandong, China, in 1988, and the 941 M.Sc. and Ph.D. degrees from the Beijing University 942 of Technology (BJUT), Beijing, China, in 1993 and 943 1996, respectively. 944

He was with BJUT as a Lecturer from 1996 to 945 1998 and an Associate Professor from 1998 to 946 1999. He was an Alexander von Humboldt Research Fellow at the University of Kaiserslautern, 948 Kaiserslautern, Germany, from 2000 to 2001. He

is currently a Professor of computer science with the School of Computer 950 Science, University of Lincoln, Lincoln, U.K. Before joining the University 951 of Lincoln as a Senior Lecturer in 2007 and promoted to Reader in 2010 and 952 Professor in 2012, he held research positions with the University of 953 Cambridge, Cambridge, UK, Newcastle University, Newcastle upon Tyne, 954 UK, and University College London, London, UK, respectively. His current 955 research interests include artificial intelligence, computer vision, robotics, 956 brains and neuroscience, biological visual neural systems, evolution of 957 neuronal subsystems, and their applications, e.g., in collision detection for 958 vehicles, interactive systems, and robotics.

Dr. Yue is a member of the International Neural Network Society, International Society of Artificial Life, and International Symposium on Biomedical Engineering. He is the Founding Director of the Computational Intelligence Laboratory, Lincoln. He is the coordinator for several EU FP7 projects.



Ziqiang Wang received the B.S. and Ph.D. degrees 965 from the Department of Electronic Engineering, 966 Tsinghua University, Beijing, China, in 1999 and 967 2006. respectively. 968

After the Ph.D. degree, he was a Research Assistant with the Institute of Microelectronics, Tsinghua University, where he has been an Associate Profes-971 sor, since 2015. His current research interests include 972 analog circuit design. 973



Fule Li received the B.S. and M.S. degrees in 974 electrical engineering from Xidian University, Xian, 975 China, in 1996 and 1999, respectively, and the Ph.D. 976 degree in electronic engineering from Tsinghua University, Beijing, China, in 2003.

Since 2003, he has been with Tsinghua University, 979 where he is currently an Associate Professor with the 980 Institute of Microelectronics. His current research 981 interests include analog and mixed-mode integrated 982 circuit design, especially high-performance data 983 converters. 984

15

930

931

932

933

947

949

959

960

961

962

963

964

969

970

977

999

1001

Zhihua Wang (SM'04-F'17) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, in 1983, 1985, and 1990, respectively.

In 1983, he joined the faculty at Tsinghua University, where he has been a Full Professor since 1997 and the Deputy Director of the Institute of Microelectronics since 2000. From 1992 to 1993, he was a Visiting Scholar with Carnegie Mellon University, Pittsburgh, USA. From 1993 to 1994, he was a Visiting Researcher with KU Leuven, Leuven,

Belgium. He is the co-author of ten books and book chapters, over 90 papers in international journals, and over 300 papers in international conferences. He holds 58 Chinese patents and four U.S. patents. His current research interests include CMOS radio frequency integrated circuit (RFIC), biomedical applications, radio frequency identification, phase locked loop, low-power 1000 wireless transceivers, and smart clinic equipment with combination of leading edge CMOS RFIC and digital imaging processing techniques. 1002

Prof. Wang was an Official Member of the China Committee for the Union 1003 1004 Radio-Scientifique Internationale from 2000 to 2010. He served as a Technologies Program Committee Member of the IEEE International Solid-State 1005 Circuit Conference from 2005 to 2011. He has been a Steering Committee 1006 1007 Member of the IEEE Asian Solid-State Circuit Conference since 2005. He has served as the Deputy Chairman of the Beijing Semiconductor Industries 1008 Association and the ASIC Society of Chinese Institute of Communication, 1009 as well as the Deputy Secretary General of the Integrated Circuit Society 1010 in the China Semiconductor Industries Association. He was one of the 1011 1012 chief scientists of the China Ministry of Science and Technology serves on the Expert Committee of the National High Technology Research and 1013 Development Program of China (863 Program) in the area of information 1014 science and technologies from 2007 to 2011. He was the Chairman of the 1015 IEEE Solid-State Circuit Society Beijing Chapter from 1999 to 2009. He has 1016 served as the Technical Program Chair of the 2013 A-SSCC. He served as the 1017 Guest Editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS Special Issue 1018 in 2006 and 2009. He is an Associate Editor of the IEEE TRANSACTIONS 1019 ON BIOMEDICAL CIRCUITS AND SYSTEMS and the IEEE TRANSACTIONS 1020 ON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS. 1021

Hanjun Jiang (S'01-M'07) received the B.S. degree in electronic engineering from Tsinghua University, Beijing, China, in 2001, and the Ph.D. degree in electrical engineering from Iowa State University, Ames, IA, USA, in 2005.

From 2005 to 2006, he was with Texas Instruments, Dallas, TX, USA. After that, he was with Tsinghua University, where he is currently an Associate Professor. He has authored over 80 peer reviewed journal and conference papers. His current research interests include analog and RF cir-

cuits design, and system technologies for wireless medical and healthcare applications.

Dr. Jiang has been the IEEE Solid-State Circuits Society Beijing Chap-1035 ter Chair since 2015. He is currently the Associate Editor of the IEEE 1036 TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS. 1037



