a newapproach to design of massively parallel systems
narendra karmarkar
august 13, 2008
7’
(jj (_
i(e
a new approach to
design of massively parallel systems
by
dr. narndra karmarkar
idrakarri , r i i i 
massively parallel systems
• given vbbty o(
• fghdensmycmosioc.
 large vv(ume of data, and
• i**ecsat cc e5 through ineinvc
benelits o4 building massively parallel systems and creatw sever usefta app1kabons is fl recognized
• however. current parallel systems have serious iimitlions
 power hungry — several m watts pei petattop i
• proqrmmlng complexity — very high
 not very general purpose
underlying technical reasons
• global communication
p.lost enegy diss1ption and delays came trorn
• moving the data around rather than
 processing locally available data.
• bandwidth is too low
latency in moving data is too high.
• energy dissipation? bit too high for data movement as compared to togkal opelatlons
fundamental reasons?
• aze these dftfruflies due to any
 logical necessity — oomplexity hory resuw?
• physical neeessity •. laws ot physics pevenhing better systems? it turns out its neither; lust histocic  
 lnctenenta1 evolut,on ol compiei architectute as trapped in a kcal minima
 fresh 1hinkin needed to pa us out towards bettet &lernalives
alternative approaches
• ctie açroades
• afloç4icnixitmg
 superccnuctng lc1c. etc.
• ira.pçct:
 re<ogrnze thai cmos is\wy qooci at ic9k
• dcmi fu vihat isn’t irokn
 pocess4rg poer c4 cmos 910ss1f underuollzed.
• irsqd. supqcn.nt moors’s law .ctrvely
• supç*y he rn1ss4n9 con’ponent
• our rewdeigii very powesful global communication
 low laticy. vec hh 1>andth
• extroni.i low ,n#rg9ibit. scs with ga capcitanc.
 hhpcnqde1’ssty
• il vith clos scaiir
several new designs
• parallel suporcomiputers bas.d cm:
 massrveij parallel quantum unnelin.
 eec,opcsand
• frnit ro,ecte gecrneliy.
• as cniçred to oont.mptaiy dos.ns.
 ovecones the boncii and latency obstacle
 iar’e rnprovenienl in eraflop6 pe k4.vatt.
 s,grthant re4ucbn in po.ramrnin3 cc*nç4exy
• bcod apbabldy to n1ny domw,s. nciuder ose reiwin a kt c4 po4nter chasrnq
can be sriwlemered usr kixin tatcabon technques.
new designs
• st.condsy storage
 low iat€’ncy. miibpoiled
• bas.d on rnagnetoop’ks
• i n.qrgs shard m.rnccy dircy at phiicai lvei.
 i’l*i aiuab1e teaeure o business date bases as well as appcans r€quir frequent shared acc€ss to iassr.e amounts of mrnon d.!ita such as gooøe e,th
high bandwidth switches
• tequued fr buikling next generbon internet inlrastru’ture
• based on mshts d’eriv’d from our a.:tares ui int.,tor point th.oiy of global opumizion
interior point approach more tham just world’s fastest linear poginmirig solver
has m1y other powei1il applications to:
 comblnjtori problems
• lock, theorem proving
• inductive inference, machine learning
• global opbmio
global optimization
trdwiion.aiiy, op1inibn eocy d.is .ith
oon.ex pfobems ______ . ess.enbat unqu olct)311 niirwnum
hver. opcnuzan prc1€cns. .ith multiple, global mimima, aris in many cont•xts
 inrature
• in .nginered syterns
they ieaj o nwnb.rotprot*ns
flndlri opbm solution
• probing optimakty
• inverse piobl.m
• synthesis
problems related to global optimization
• finding optima solution
given energy function — find minima jvalue and lcc4loms)
• piovlrg optrndty
• given one such solution.
• how does one prov. that there s no bdter solution?
 wh types of proofs ate possible?
• how long such proofs might be?
 how do dàfferent global minima relate to each other?
• inverse problem
• if we kn i1 (or my global rninwni,
 how ci we find the energy function?
• this is like r.vers. ongineedng the nature.
synthesis
• givi specthcabon c4 dqsircd rrnrna ii terms of their
 value
• ioc)ons
 c shape ot energy landscape arouri ti’e nilra
• hov do we dmqn a sy4em th proptiate energy 1uito
sub.ct ii, fur1r com*amts?
 rnt’tire cit rec( and rwerse pcblems in the s.me contea
• all these questons are
 related to each &er d also to
• he,, sysierns occumng in nure bqia’.e
 ho artiflcall erineered systems might be designed
energy functions with multiple global minima
in physical systems
exampi, — ciys*ai structure
potential energy function is
porkidic in 3d
the relaled stwcture has translational symniety
as>,
/\ f\ i\
;j ki \i \‘ i
mathematical analysis of the structure
qictron in lhø rsulbr struct’.re can b’s analyzed thece4.ai. thecr ot ‘smijic1ois based cn s1i analysis
• 30 tiansiaona1 symmetries ca the sure fonn a gioup
• unit ceir of ie cfystal — caned ftndrnemal dom.n
• tratnl.ted copes ca tte fundamental domain exacttj fdls up 34pace.
similarly, we atttsaliy create a stri.ure with symmetries w. wam
energy functions with multiple global minima
in physical systems
exampi. — cys*ai structure
the related structure has translatiorsal syn’imety
potential energy funori is pqriodic in 3d
l.
mathematical analysis of the structure
mc4icn of qlqctrcns in th reutbnq $ifuctixe can b’s analyzed theoc.arp. th€’ory of emiothcio b.as4d on suct analyis
• 30 transiaona1 symmeiriss of tje stnxlure som a gioup
• urut ceir of e crystal  cafled fundi’smai dom
• tranil3led ccpes c4 the tundamenlm dcmain exacdy fills up 3space.
similart’. we taciajiy create a str,cure with syrnmetl.s we wam
artlflclally created symmetry
0
analysis of artificially created symmetric
structure
• wq re atis to arialyq n1cqa ci qiectrqn$ in such gructurq it9c4.bllv based on our r*w they. pev4ous theoty ccud only analyze eledron opbc systems it? dorrdrant centiar oplical axis)
• pcture shows oecboms cd cakulaled 30 4ectron trale*oces.
• the structire s based on flrre ojecte ge<rneb’y. which 1o4rns tte basis of mteprocessol ocrrnmicabon in this achecture.
s beloe. cos of th h%idarnerral donain fn%311? bj asn of ie sinmetry group fill th• fr spac excfly
within eachi tuniarner6ai region. there is a 4cncepual tut’e fc elecbon flow
quantum tunne4nq at aie or.n of each tube is very fast 10’ sec4
artlflclally created symmetry
0
analysis of artificially created symmetric
structure
• wq abq to analyq nicn of •lectrçn in such gructur, tc4.elly. based on our r*.v thecy. ( pevous th’oy could only analyze electron opbci syslenis ith dcirrunant .centjar opai axis)
pcture stows 2t> pojectcms cd cakulaled 30 e’ectron tralec*or.es.
• the structure s based on flrte 4o)ective georneb . .‘4flch buns tte basis c4 aiteprocess 00 unicabcn in this architecture
as be4ofe. coçies of the tun(larnerral cionain m1.? by acn of t)ie srnrneiry group fill the free sp xctly
wrthn each tundamer4al region there is a 4conceptual tube tci electron flow
quantum t’mn€g at ue orogin of each tube is very fast ( loll c.)
basis for energy efficiency of the structure
ouanturn tunn1ir is a icss4ss procsss
mement ot qlqctrons thru frqq sp1x is also 1c654955 as ti*re a’. no colhsens
energy electrons gain in th.e’ first hafl of the thit c accelecabc’n phaie) as eturned to the field in the seconi hmf of the ftit (deeer.mic a3e)
ks a rqijfl. tns suifaco iiorrial coniniunic.abom link in 3’ dniomsion is rr .n.igy .fficiecit than oon*nbonal surface parallel links on the chip itself
only energy loss is in ha gin sthrging of g1e c res in rf*1 nd recei”w. vtich is siindar for on.d4 local inks betn logic 1oc1
such 30 lwks a’e fonnally as poitui as mtinite ni.rnber of 20 links
onc. you have these. you can redute th. nlanbe of mecabation
layers in crlios and save oot
encoding information on
electron channels.
• aein exp4t nhtjit1e global cçtvna in quantum systems —
• choos ii quitisn sem ter• a prri (without any q¼clñc fi4d tunrw4mg piobabibty has number cit cba1 ma’ima for sbnc1 cflccms
• cr.a bas in tavnr cn cit th outcon*s by pfy19 vvftaj• qnccdiri’ on• oei ot rmany posib1ibqs
introduction to finite projective geometries
consder a finite held f = gf(s) hasir s elw,wnts wti...s=p, ppr urrii,k.poitiv.inthgi.
take a (d. 1) nenionai vector we o’er the rebl f
each ray through the cgm a poinl ot the piojqcbv. space
each (n+ i dirngnaonai wbpace of t2w ‘1’ecto’ space rc n dirnensicmal subspee ct the poecfre gecrn.zy
number c4 points hi the ixcj.c*re space gn t
1’ifji= a,j
iciiik’cturc basrd on piojcciit’ (koiiirtr’i i, skci anii.krs i’if th eorn.irv
p. k. ! p..! (gf( pt h
li 12, di.i’i iik 1kciiin lit all pr0iectib libp.acs tf
ot’ diinensin i.
thus.
— 4l i)i all loiflis
 s4.1 lit gill iiflt.%
— sc.1 tt’ all liypcrii1anc. ctc.
cvnidr ci’ik.ioii of dinwiisiori’ 0. i. •.d1
2. f.kh ha1i.l;lri 1i.sliu1c4. is ass&iatd with a stihspac
3. two ioqiies ii pondiiit. k x. v are ‘inwtd
ifl• x y
and dim (x dim (v) —i
rul. based allocation of computational resources
smaiie’st dirmens4’on of subspace contaww all the requied data d avdable ccmpuhng resources does the ccmputb0fl
this rule can be directly implemented m hardware
examples of computation done at various dimensions
1. computation done 31 points (dini 0) ——• p11,1
2. computation done at lines (dim = 1)
consider a tia1y operation
sujlxzse * € memoly m4xk11e corresponding to point 4’
suppose h ‘ menx4y module correspoxcj to pou p pair 01 po4nts u fli deteimire a unique line
ptocesso associated with the line is iespcsable tor tt operation
3. computation done 31 plane (dim 2,
j04n opefatlon in reiatnal databases
al rovs mapptng to the same be stoied v memory module ___
c*conosiiion of tnry re1iion
a1 b1 b2
map
1 4ems pci
2. pair of ‘os • line
jrows or iei&lion)
3
eq
a b
m
k
—
iir,•
{a.b)
i
(bc)
m
(p.c)
n
let p = pocesoi rresorxs to iiane < i j. k>
operbon p
• each processor performs portion of the loin can see dlrectt?
• parallekzeg automatically
location and acicess to a pair depends on its value and not just on some arbitrary address
• generalizes contentaddressable memory to vofltdoatqvq3 of values
• e’plot semnfic locality based on tradibonal special d tempor locality based on a4lresses
4 computat4on at higher ddmenslons ( dim ‘ 2) very poerm br aorithms involw
tensors
hypergraphs etc.
as compared graph search algorithms. hypegiaphs can capture more semantics in structural form
deconxdion of bnry re1iions
map
al rows mapping to the same ke stced on memory module ___
a1
bi
2lp2
1 nems
2. pair of points • line
(rows or re1aion)
3.
.1
k
a i
b p
11
___c___
k
—
hr,
(a.b)
i
(be)
m
(ac)
n
let p pocessor corresponj to (ane c 1 j. k>
operbon p
• each processor performs portion of the iorn can see directi?
• parallehzes utomabcally
• location and access to a pair depends on its y3lue and not just on some arb1rary address
• generakzes contentidress.ab1e memory to cqn’t4n.atcn.qf values
• e’p4ct sernn1ic locality based on v)us. lradil.cnal special d ternpor locality based on acflresses
4 computation at higher dimensions ( dim 7 2) vefy povherm lot a1crihms ini
tensors
hypergraphs etc.
as compared graph search algorithms, hypetgiaphs can captul, more semantics in structural form
virtual memory organization based on subspaces
generally, mtual rnemoy is ofganized ra a berarchical fashion:
total memory space is divided into paes pages constt of words and rds cnsst of bits in the p4ojectre geornetiy archflecture theie are superpges oganl’ed m the form of lattice
two supepes are either disjoint or intersect in another super page (oc page at the bottonwnost level).
each superpage is associated with a subspace of the projectice spe, and intersection of two superpages is associated vith the tefsect*on of cof responding subspaces
accessing mawvl pages from the same supeipage can ie&cl to predictive fetching of the entire superpage enabling eplotation of another kind of loc&my lh.at is 1requenhi’ pcesent in marry applications.
pt’i’fcct pitlciii.’ foi’ 2.d (tonwti’
lt n = iiiinilr iit1iiiihis nuinhi oiiiis
point corresponding lines
pairs
1 (pqq1) ii =
2 (p2,q,) 12 = <p2q2>
n (p,q) i =
.k i \ccss 1’ittrn is zi c’iici.(ioii ofa mdrd piss of j’iids s.t.
i firsi iliilibr% otall paits (i1i p’ .. 1l41fl a irmulatiofl otall
2. si.iond sn.mhcr iq1. q. ... . • iknnutalion.
3. ilie lines 1i• l. .... l) determined tv these iars torn) pemuhtalions ti all hiws of the eoiueiry_
ckriv. if otw sl hik biiiarv l.)j 11ttinc lit [h1k1111.! to sud a sd of ti)d — pairs frr piraiki dxeclilion
i. thcrc, lic 1)0 ia4i — tik l’il1ll4.1s ill iflci1\ acsss.
2. ‘1her is no conliki in procc&cor usac
• all iiroccrs arc filly utilized
4. memory bandwidth is filly utilized
hence tl namc pcrfccl pat(rrn
i)’ flnl t ion
a ollectin of pc.rtët patterns is calkd complete ii iv iih4ic pair a h oclirs ill caciiv o1)c puticili
pu’fci pdirin foi’ 2d (tomt’try lt n = iuiinhr of jiøi.his = nunbcr of iiies
point corresponding lines
pairs
1 (p1..q1) ii = <p1,q1>
2 (p2q2) i, = <p,,q2>
n (p,q) =
.\ iii.i .‘is.s 1’.ttrn is i cilkcit ii of n 1icrei piiis of iids s.t.
i fiisi inn)br. of all pairs % ,. .p) 1i4m a irmulation of all ll
2. s4ofld jhh1icr iq1. ••• liii — ikflit4111011.
.3. ilic li,es (ii. 1... _.. 1.) detcriuinld 1v1 il)l’sc inirs torni lrmuhiiknis ‘f all liiic of ihc csi’inciry_
ckarlv. if one sclidiles binary oin1ion responjiii to such a set of ti)de — pairs fri p.iralld eecutilnb
i. there iiic no read nie co,itlicts in ineinor’ accesses.
2. ‘iheic is no contlici in procecsor usage
3. all rrocers are fills utilized
4. memo bandwidth is lull’ utilized
hence the name pt’ifccf pattt’rn
i )t ibil ion
a collection of rtcct patkrn. is called coniplt’k’ i1 cverv iikie pail 4ah a b clirs ill eaci1v one pattern
pt’rlcct .cccc lid trnl% for 4d (;tniti
1_t.( ii — iiufllbci 01 j1141111% iluinliel ot iiws
trçi.t of ioints
trçi.t of lies
planes
i 2
n
(p q.r1
(p.qj1r)
(p.q,rj
u <p1.q>
u= <p2.q>
l,
ys <qr>
v1= ‘q1r2>
v. ‘q,j>
w. cr..p1>
w%= <r%p>
w <t,p>
l1.zp.q.r>
h1=p.q%r>
hpq.i,>
a ptrftd pitkrn ls ;1 ciiktwn ‘tn .nhl ctriric3ucti lh1il
• luic u1. u. , i1dcrnflrcj by 1u3t jhf >i i%’mb tha1 cidizipkt 11 r1rml.i1bnqfa1i iiix%
• iniiiarl. linc &tcrrnrncj by piir iii.r,i lt(ln c;ichtrirlcl ttm a
hnt
;nh’i iiiics jctfnhziicj b 1ui r1. ils’ k’rni ;i
• pbnei h1 h k dekrninc,i h the n triplets ira i perrniililil’m ‘f dl pl;inc’i
automorphisms of the geometry.
it map’s points points
lines • iwes
p4anes • planes
any subpace aflothef subpace
of dimension k c’? dimension k
aulomophism group ot the geometry acts trinsitiv.iy on tii wb$puce$ of the geometry
peifect p1terns end complete coltection of perfect patierns can be geaerated usrng the autcmorphsm group of the geometry.
for wnplementabon of pelt ect patterns using electron ophcs. we express the iequired eiectromagne1 fe1is in terms of symmetrised mumpole expansion using suitable subgroup of the automo4’phism oup.
to show p.1ures of h dimenson.ai geom€nes often use 2d prqeclions of electron beam tr ectories imp4ementing tha* geometry.
here are some examples of 2d geometries
7 polnl eomtiy
of(4) is x eitension cgf()
21  pant georrtry cn1ir 3 cies 1 7— pdn.t mey
1 poliii qeometiy
electromagnelic cavity machine
i
7
n
i.
electron fla tubes
electromagnetic cavity machine
thq swface of us cai r kd wh cmos lcec cscur
ocgnced s rr9hchn of ulha low power cores
designed 10 enable massive muttiihreadlng
a pea..flop o uiation rqu.s only about a squ. meter of sthccn real estate for corhpa.fling orc even t they e 4sesined to nm at siaer cck sped to snr1’canh1j mp4o’.e performance pe wait
brs4i4 computrnq. the surface c2rcwts also proe bc types o corflfliuflt)cn d96c9$ b%qmsn con)4,ing ioucs
the first t,pe s soc the tradmcral x) neatest ri t*c cliabai.
surface normal communication
th, second tpe new tpq of surface nonw cvmmur’cn
 bie4 on rrively prid qtsiritum ttmnerg rd
 tiee spe eleclion opcs twough tte cavàty volume
tts type of giob commurcatom uses h*i smmetnc flo,v paerns derred from rnathemnbc$ structise of fnrte prqecbve geomey
its nnpiemenabcn intes a nel electron oc(cai system that does mt hay. any dornrnan ccrr optical axis
the e( omagn€4 tels to guise electrons aion e requred rnasstvel’j paranel a)ectcnes set > by crqatr approçnatn bqundy condvons on the surface of the eiectronanebc cavity
the electrode% reqiired to apply such bur±ry condbo.n e perned on the surface of the cavity by standard rithoqrl technklues.
tte ieceivei’s and other contrdiir eiectroncs reuve4 kw this purjxse is located on e surface of the electrornagnebc cavity.
electromaçjne(ic cavity machine
i
nt
i,,
in
i
electrcn tion ut’
electromagnetic cavity machine
th ccgariced s rr4hcn of ultia low pow,r corcs
designed io •natle maslve rnultilhreadlng
a p.(a.flop confia1ion rquei only about a squ. meter of silicon r.al estate 104’ corhp4jtirg orcs even it they te designed to iun at sier cck speed to sr1%cantjj improe perfoimance ir watt
e€rs4bis ccmputir. tte swface ciccuits also pro61e tog’ types o4 communticn dq4cqs bvt.i,qn corrçuting isoucs
the first t,pe is bc the trrtional ) neaesi rig cmiunicao.
conununkatlon bai.ed on election optics through the cavity volume
• ‘ne chcise a complete set or peifect patterns x a kfold covedrg ct pefect patterns
• ccrespcndin9 to each perteci paitein. the ca’’#’‘o1urne is partt4oried into fundamental domains
• each fundamental domain contams a iube in free sç’ace br t.k).mmed .bclions
• the e,ds f the lube lia’.e field ecnrtters and detectors ioc1ed vn the cav4y surface
• ‘m,qn the perfect pattern chanqs the .rrdtt.r arid decectis are connected by a n’u.v ccnfiuratiom of electron tubes
iiiustrtlon of this process wktti simple example
• choose the smallast {p2. k i. d2. 7pent xo,ectre gscntri.
• choose cie petect paiern br the
(iait4 01 ofl* p*det perito i z& .odly mgi
partitioning of the cavity volume into fundamental domains
an example cii building block for prnrig free space
i “i
scm, curvei skies cii the budding blvcks ha.* ngents v.thich corresponci to same .bmnem ct be aiqetra
*p.dec1p1em
rof 1poà2d.cm.fry
example of space partition of the cavity volume corresponding to a perfect pattern
ccspcndinq vikrne bcin
example of tubes for electron flow for the perfect pattern
a p,f.ct paern
correondm qlectrom flwiubc
the er1cai conponent of e electrom mocin is de to electrm i.bj the ewcutw component is due to n1re1c fie in e aiai dwei:tn relsumng in orveraa h!ehc.ai rnot1oi
diferen1 ef1ect p.mte’rrs from corrçlete co1ieon cn be achieved rnç4 by ciiargin th r cd magmude of the electric to rnagntc field
the physic coniplexltf of ocmr*chr n sources to n destnabons in oiw hop using cornp4ee se( of prfct pcrns is on),
mermatrv arrangements that ha’e been used ui cvnleoipc’ary dejgri$ q1 0 hopcctr.ctioniso(ri2.
atternative approaches
multihcip n,iworis increases latency.
muitpiie energy used per bit communicated
l the number ci hops siro each tine a bit is recerd, deect.d arnp4ihed and reransmfled. eddvt,onal energy is consuwned.
earth simulor(japaii)
a s4rie hop. vey gxxi arcmrtecture fro’n the point of v.ew 04 enerahty. but us cvnn ctr of on9 physlc.il compladty
furter iefinrments — reçlace elecincal cables wh opt)cm fees and dense wdm
this saves opt,cai fcs hich is the cleapest ccnponen anwav).
i4ow*’.*r tue tct pbysc complexity is still oj&
since there are 0(n) sii m4jxjdmux oonwents. each of 0(n) cocn4et.
sngie hop c niunc.atic tock — oorrie!4e bçarbte graph
• in contrast, our design provides communication channels high bandth
low iency. snjietiop
electromagnetically reconfigurable energy efficient
the physical compl.xlty of the o’eril design is o(rt.
other benefits
high packing denjy
urkq op1o3ectron d,hic$ in ie irrfraed rng. ose paciur dqrnity is orde of rnagnati.ide .s than currer ioe circuits. dl,e to di cbor. bi’iit feki
woo dces for sirfcc norrni comrnuntcan cain be pckd with much hherdenity(’1()’?sq ar)
single electron devices
in our desaqn. both loqic circuits and comrrunica1on de’.ics on the cay wrfac, wi be øxnaiy bd on quar4um tunneling, in iarntial nd noirnal dveclions resçecteiy. arid both can be operated as sirie &ecuon de’.e’
ha conveor belt for electroos
conclusion
the new de ies to ma4v4i parallel systenn them are
ghi •n.gy etfcnt. and corrç significntiy siiiper to program arxi toady appcl o many dains
this %ii nbiq pro snrng of sigriiflcntty higher computing power openmg rnnber ci new business opportumti.s fof 1xflq more lnte1iiem seivices on the w1ecr*t
for ne infoirnon) pes
fdk.’ irs on wkipea pe aboc author
gmail n3rqn&akanmarkar@ yahoo coin
a
i

p
u
___
i 1
t
li