Περίληψη
O stauragwgìc eÐnai to pio dhmofilèc dÐktuo gia yhfiak sust mata ìpwc oi dromologhtèc touInternet, oi poluepexergastèc (se pollapl tsip sto Ðdio tsip), k. . Wstìso, kaj¸c to kìstoctou megal¸nei me to tetr gwno tou sjènouc tou, kai lìgw prìterwn ulopoi sewn se di forecteqnologÐec, eÐnai eurèwc apodektì oti o stauragwgìc eÐnai apagoreutik akribìc gia sjènocmegalÔtero tou 32 64, opìte kai qreiazìmaste pio polÔploka dÐktua, ìpou o stauragwgìc eÐnaidomikìc lÐjoc. Sth diatrib aut , anaptÔssoume arqitektonikèc VLSI kai tic leptomereÐc diat xeicaut¸n, prokeimènou na klimak¸soume to stauragwgì se sjènoc arket megalÔtero tou 100.Sugkekrimèna, diat xame leptomer¸c se VLSI ènan stauragwgì 128×128×24Gb/s, o opoÐoc dia-sundèei 128 ‘swmatÐdia’ qrhst¸n tou 1mm2 se èna qop, qrhsimopoi¸ntac 16mm2 se teqnologÐaCMOS twn 90nm. O stauragwgìc èqei eÔroc 32bits, trèqei sta 750MHz, kai katanal¸nei 7Watts.Se efarmogèc susthm twn dromolìghshc, ta swmatÐdia ja perièqoun mn mh, ulopoi¸ntac sundua-smènh en ...
O stauragwgìc eÐnai to pio dhmofilèc dÐktuo gia yhfiak sust mata ìpwc oi dromologhtèc touInternet, oi poluepexergastèc (se pollapl tsip sto Ðdio tsip), k. . Wstìso, kaj¸c to kìstoctou megal¸nei me to tetr gwno tou sjènouc tou, kai lìgw prìterwn ulopoi sewn se di forecteqnologÐec, eÐnai eurèwc apodektì oti o stauragwgìc eÐnai apagoreutik akribìc gia sjènocmegalÔtero tou 32 64, opìte kai qreiazìmaste pio polÔploka dÐktua, ìpou o stauragwgìc eÐnaidomikìc lÐjoc. Sth diatrib aut , anaptÔssoume arqitektonikèc VLSI kai tic leptomereÐc diat xeicaut¸n, prokeimènou na klimak¸soume to stauragwgì se sjènoc arket megalÔtero tou 100.Sugkekrimèna, diat xame leptomer¸c se VLSI ènan stauragwgì 128×128×24Gb/s, o opoÐoc dia-sundèei 128 ‘swmatÐdia’ qrhst¸n tou 1mm2 se èna qop, qrhsimopoi¸ntac 16mm2 se teqnologÐaCMOS twn 90nm. O stauragwgìc èqei eÔroc 32bits, trèqei sta 750MHz, kai katanal¸nei 7Watts.Se efarmogèc susthm twn dromolìghshc, ta swmatÐdia ja perièqoun mn mh, ulopoi¸ntac sundua-smènh entamÐeush stic eisìdouc kai exìdouc, sun èna mikrì komm ti logik c elègqou. DeÐqnoumeoti aut h arqitektonik eÐnai h kalÔterh se mia klÐmaka arqitektonik¸n dromolìghshc, gia dÔolìgouc: (a) 'Eqei ideat apìdosh qrhsimopoi¸ntac mìnon mikr epit qunsh sto stauragwgì stic mn mec, anexart twc sjènouc; kai (b) diaireÐ th mn mh minimalistik , parèqontac (i) uyhl puknìthta SRAM qrhsimopoi¸ntac lÐga, meg la, kai ètsi pukn mplìk, kai (ii) uyhl apasqìlhshmn mhc mèsw tou apodotikoÔ diamoirasmoÔ thc metaxÔ ro¸n. Se efarmogèc poluepexergast¸n, taswmatÐdia ja perièqoun ènan epexergast kai thn kruf tou mn mh. 'Otan h kÐnhsh eÐnai kajolik kai èntonh, èna tètoio sÔsthma eÐnai antagwnistikì proc ta dhmofil sust matamesh, lìgw thcaplopoihmènhc dromolìghshc kai katanom c fortÐou tou stauragwgoÔ.Gia na klimak¸soume to stauragwgì se uyhlì sjènoc, anaptÔxame kainotìmec arqitektonikècVLSI. UlopoioÔme to drìmo dedomènwn me dèndra pul¸n polÔplexhc, kaj¸c oi trikat statecarthrÐec kajusteroÔn polÔ lìgw eggen¸c meg lwn parasitik¸n qwrhtikot twn, kai deÐqnoumeoti sumpukn¸nontac ta dèndra polÔplexhc aux noume thn taqÔthta touc. Epiplèon, deÐqnoume oti:(a) H epif neia tou stauragwgoÔ kajorÐzetai apì tic pÔlec polÔplexhc gia ìlec tic praktikèctimèc tou sjènouc tou N kai tou eÔrouc tou W, kai ètsi megal¸nei wc O(N2W), kai ìqi wcO(N2W2), rujmìc me ton opoÐo ja meg lwne an kajorizìtan apì ta kal¸dia, ìpwc pisteÔetaisth bibliografÐa; kai (b) h kajustèrhsh tou stauragwgoÔ kajorÐzetai apì ta parasitik twnkalwdÐwn, kai epeid to m koc twn kalwdÐwn megal¸nei me thn perÐmetro tou stauragwgoÔ, hkajustèrhsh megal¸nei wc O(NpW), kai ìqi wc O(logN), rujmìc me ton opoÐo ja meg lwnean kajorizìtan apì tic pÔlec, ìpwc pisteÔetai sth bibliografÐa. Tèloc, deÐqnoume oti mèswprosarmosmènhc topojèthshc twn pul¸n, ta ergaleÐa hlektronikoÔ automatismoÔ mporoÔn naodhghjoÔn se lÔseic pou ekmetalleÔontai apodotik thn plhj¸ra twn diajèsimwn kalwdÐwn.Gia to drìmo elègqou, melet me mia paradosiak arqitektonik tou iSLIP –enìc apì touc piodhmofileÐc qronoprogrammatistèc par llhlhc antistoÐqishc– h opoÐa ulopoieÐ thn apìfash an-tistoÐqishc k je eisìdou kai k je exìdou se èna diakritì mplok epopteÐac, kai epikoinwneÐ ticapof seic antistoÐqishc qrhsimopoi¸ntac sundèsmouc mplok-proc-mplok. Pr¸ta, deÐqnoume otioi sÔndesmoi katalamb noun epif neia O(N4). 'Etsi, ènac 128-sjen c iSLIP katalamb nei 14mm2,ìpou oi sÔndesmoi apasqoloÔn perissìtero apì 50%. 'Epeita, parathroÔme oti ta eswterik kal¸dia enìc mplok epopteÐac katalamb noun epif neia O(NlogN), kai proteÐnoume mia nèa arqi-tektonik , h opoÐa antistrèfei thn topikìthta twn kalwdÐwn diafull¸nontac orjog¸nia ta mplokepopteÐac, kai ètsi mei¸nei thn epif neia twn kalwdÐwn se O(N2log 2N). Me thn arqitektonik aut , o 128-sjen c iSLIP qrei zetai amelhtèa epif neia gia touc sundèsmouc, kai qwr ei se 7mm2,to opoÐo apoteleÐ meÐwsh 50% se sÔgkrish me to paradosiakì. Gia ènan 256-sjen iSLIP, h meÐwshkonteÔei thn t xh megèjouc. Tèloc, h sunolik kajustèrhsh eÐnai mikrìterh apì 10ns, kai ètsio stauragwgìc mporeÐ na leitourgeÐ me pakèta tìso mikr ìso 30By tes.
περισσότερα
Περίληψη σε άλλη γλώσσα
The crossbar is the most popular switch for digital systems such as Internet routers, clusters, andmultiprocessors (on-chip, as well asmultichip). However, because the cost of the crossbar grows withthe square of the radix thereof, and because of past implementations in various technologies, it iswidely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radicesmore complicated networks are needed, where the crossbar is the basic building block. In this thesis,we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and theirdetailed CMOS layouts.As a case study, we laid out a 128×128×24Gb/s crossbar, interconnecting 128 1mm2 “user tiles” in asingle hop, using just 16mm2 of silicon in 90nm CMOS. The crossbar is 32bi t s wide, runs at 750MHz,and consumes 7Wat t s.In router systems, the user tiles will containmemory implementing combined queueing at the inputsand outputs of the crossbar, plus a small part of logic ...
The crossbar is the most popular switch for digital systems such as Internet routers, clusters, andmultiprocessors (on-chip, as well asmultichip). However, because the cost of the crossbar grows withthe square of the radix thereof, and because of past implementations in various technologies, it iswidely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radicesmore complicated networks are needed, where the crossbar is the basic building block. In this thesis,we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and theirdetailed CMOS layouts.As a case study, we laid out a 128×128×24Gb/s crossbar, interconnecting 128 1mm2 “user tiles” in asingle hop, using just 16mm2 of silicon in 90nm CMOS. The crossbar is 32bi t s wide, runs at 750MHz,and consumes 7Wat t s.In router systems, the user tiles will containmemory implementing combined queueing at the inputsand outputs of the crossbar, plus a small part of logic for port control. We show that this architectureis the best among a range of known router memory architectures (e.g. totally shared memory, solelyinput queueing, or crosspoint queueing), for two reasons: (i) It gives top performance using only amodest speedup on either the crossbar or the memories, independent of radix; and (ii) it partitionsthe memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few,large, and area efficient blocks; and (b) highmemory space utilization through flexible sharing amongflows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part oflogic for the processor. When traffic is global and heavy, such a system is competitive to the popularmesh-centric systems, owing to the simplified routing and load balancing of the crossbar.We made high radix crossbars feasible by developing novel VLSI micro-architectures for both theirdatapath and their control path. We implement the datapath using trees of multiplexor gates, as tristatebuses are slowed down by intrinsically large parasitic capacitances, and we show that highly concentratedtrees are more area efficient by further reducing the parasitic capacitance of their internalwires. Moreover, we contribute an experimental analysis showing that: (i) The area of the crossbar isgate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not asO(N2W2), which would have been the case had area been wire limited, as is commonly believed inthe literature; and (ii) the delay of the crossbar is dominated by the parasitics of wires, and becausewire length grows with the perimeter of the crossbar, delay grows as O(NpW), not asO(logN), whichwould have been the case had delay been gate limited, as is commonly believed in the literature. Next,we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate thatmodern EDA tools can be guided to exploit the abundance of wiring resources through custom, butalgorithmic placement of gates.For the control path, we study the architecture of iSLIP, which is the most popular parallel matchingcrossbar scheduler. In particular, we study a traditional iSLIP architecture that implements thematching decision of each input and each output of the crossbar in a separate arbiter block, and communicatesthe matching decisions between the input and the output arbiters through global arbiterto-arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter linkstake up O(N4) area. Thus, a r adi x-128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiterlinks account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(NlogN)area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving theinput with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log 2N).Using this architecture, the r adi x-128 iSLIP scheduler becomes gate limited, fitting in 7mm2, whichis a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almostan order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thusallowing operation with aminimum packet as small as 30By tes at a 24Gb/s line rate.
περισσότερα