X-Message-Number: 4168 Date: Fri, 7 Apr 1995 19:49:49 +0200 (MET DST) From: Eugen Leitl <> Subject: uploading vistas Cellular Automata, Soft Circuits, Neural Nets & Uploading. 0. Disclaimer I want to present here some computational ideas I deem relevant for the uploading (sub)community in the cryonics movement. This is going to be a series of longish posts, having only loose links to the cryonics mainstream, so I apologize in advance. Moreover, there is a lack of consistancy, since they are pasted parts of a larger paper. Tell me to quit if I am wasting bandwidth of this list, and I will. 1. PalaeoComputing: The Age of the Dinosaurs. The CA computation paradigm, akin to the Turing machine is not exactly new to the computer society, having been invented by John von Neumann in the 50's in his quest of building a self-replicating machine (recent xref: nanotechnology (NT)). While the Turing automaton has had a heavy impact upon the way we do computations (all current machines are, at their heart, Turing automatons) CAs, until recently, were quite unknown. What makes all computers in existance seem as poor copies of the Turing engine? The random access memory (RAM, core, store) is linearly addressable, one cell at a time, which makes it equivalent to the Turing tape (it's finite, though). That it is random-access is quite irrelevant as it can be simulated by the Turing machine through macros. The CPU has a local state, it can do reads and writes to the "tape": RAM. All additions, as caches, hard disks, tape drives, etc. are only irrelevant artefacts of an efficient real-world implementation. The von Neumann architecture (which is, ironically the name given to an architecture based on the Turing engine) has a serious flaw, which has become prominent only in the recent 5-10 years. The duality of a smart central engine and a distinct dumb memory where data/instructions are stored did made perfect sense, once when data was recorded as sonic waves in mercury tubes (mercury delay lines), as local changes in magnetic domains in a surface of a rotating drum, or as electron buckets on a surface of a special CRT or as magnetisation of small ferrit cores till in the late 60's. Tubes or single transistors, were much too costly to become the basis for RAM in these days. With the advent of RAM transistor ICs, where the same elements: normally-open-switches are the basis of both the CPU and the memory this natural distinction became obsolete. Of course, the mainstream plodded on in the same direction, the investments in programmer training/hardware as well as simple human conservativism being its carrying momentum. The crisis became visible only in the last decade, where the CPU speeds began to overtake RAM speeds: the von Neumann bottleneck became prominent. There is every reason to suspect that our current designs begin to be vastly ineffective with the advent of modern 32-bit CPUs as the i386 or its RISC cousins. Some symptoms of the creeping disease: 1) The total amount of transistors/system vs. performance curve shows distinct saturation, most evident in the 8080/8088/286/386/486/586/P6 (Intel) or the 6809/68000/68020/68030/60040/60060 (Motorola) evolutionary (?) sequence. Other CPU lines show the same trend. Even the (still) exponential MIPS/Buck function shows signs of slowing down, as the semiconductor photolithography rapidly approaches physical limits. 2) The time needed for a total memory search has _decreased_ constantly over last 15 years. The real-world memory bandwidth (only 30 MByte/s for a well designed Pentium board as to 15 MByte/s to a i486 has hardly risen at all when one considers concurrent bus width expansion from 32 to 64 bit). This trend would be much more visible if current RAM prices wouldn't be artificially (price speculation/bad system design) held up. A modern machine should have about 32-64 MByte instead of 4-8 MByte by now, if we cling to former extrapolations. The hierarchy of caches cloaks this phenomenon, but the discrepancy between CPU and memory speed becomes even more prominents when caching fails. (And for a big problem class it does). 3) The amount of transistors devoted to pure computation (ALUs, registers, glue logic, etc.) has decreased constantly. One has trouble locating the CPU among all the caches, branch prediction units and the accomanying glue logic. I would estimate the core CPU (sans FPU) taking about 5-10% (sic) of die size of latest (recent BYTE issues) designs. There is more of above, but by now I think my salient point(s) is/are obvious: Our present computers have become increasingly inefficient for certain tasks (as e.g. NN emulation, connectionist AI, fluid dynamics, etc.) and grow even more so with each generation. The reasons are partly the lack of innovation/investment conservativism, partly the tremendous commercial success, since these machines _can_ do, what they were tailored to do, very well. Bureau suites, e.g (PCs). Or matrix math (scientific supercomputers). The "all-purpose-computer" idea begins to sound somewhat hollow. 2. The Shape of Things to Come Things aren't that gloomy as they appear to be. Particularly, Texas Instruments in its DSP line seems to converge to multiple DSPs with integrated RAM on one die, a potentially very promising approach. Several obscure chip lines, as e.g. based on the Novix Forth chip by Chuck Moore offer significantly more bang for the transistor buck. The Inmos (Inmos/SGS Thompson now) Transputer chip line, though still severely impeded by recent privatisation (thanks M. Thatcher for that) prepares the launch of its Chameleone chip. Several special-purpose neuroarchitectures, as Siemens' Synapse 2 engine have been built recently. In recent MIT Press publications on connectionist AI even wafer scale integrated neurochip engines have been introduced (5-8 wafer, 55 kTransistors/neuron, 8 MHz shared bus, bult from die segments). Why is WSI one of the key features of near future's computers? 2.1 The Lure of Wafer Scale Integration Current chips are produced en bloc (32-64 dies apiece) by wafer steppers through semiconductor photolithography, the silicon wafer then cut up with a diamond, tested, packaged and tested again. Because of the random defects roughly 1 cm^-1 inherent to the production process, usable chip yields decrease exponentially with the die size. Defective chip yields of a standard die lie often beyond 50-90 %, the chances of producing a defect-free wafer sized circuit resemble those of encountering a snowball in hell. Since chip production requires a large number of steps, each needing large amounts of very pure and often highly toxic chemicals (thus being an extremely environmentally damaging industry), reducing defective chip yields even slightly would be highly desirable. Fault tolerance of a wafer-scale integrated circuit must be achieved by _redundancy_ and _self-healing_ at the level of the individual die as well as at wafer scale. In highly repetitive structures (particularly memory cell arrays are highly redundant) self-healing can be achieved by bad cell remap via software. By keeping the accompanying CPU area small, random defect hit probability, which would render the die inoperable, can be greatly reduced. DRAM area defects are not lethal as they are mapped and marked as unusable by software. Finally, by introduction of die self-test and remap at wafer level _several_ defective dies on one wafer are tolerated. Since cutting up and repackaging is a major source of costs and introduces additional defects, usage of the wafer as whole element is highly desirable. As we will see, devising WSI-appropriate hardware and software architectures is not such a big problem as one might imagine. 2.2 Why They Are So Slow All modern systems suffer from a kind of shizophrenia: the basic elements our computers are built from are fast, and consume little power whereas the computers themselves are both slow and dissipate a lot of heat. A silicon tunnel diode based on quantum effects works at 3 THz (Tera Hertz, 3*10^12 /s), a ring oscillator or transistor still switches at 0.3 THz (300*10^9 /s), whereas a modern PC's CPU can do memory accesses at 0.00003 THz (30*10^6 /s) Just count the zeros. A whole _lot_ of them, don't you think so? There are two simple reasons for this. The first: our CPUs have _Complex Designs_. Yes, even the RISC ones. The logic gate delay times get added up arithmetically. A complex logic engine needs a whole number of clock ticks to settle into its final state. Reading its state before is risky: the result could well be still in the making. The second: there is the _Problem of Locality_. As memory chips and the CPU sit quite apart from each other, the signal has to propagate through the wires first, is then being processed (relatively fast) by the CPU and gets RAMmed back through the bus. The propagation velocity of the electric signal lies well below the speed of light limit. Why? A wire has a capacity and a inductivity both. As electrons move, they generate magnetic field opposing their movement. The more electrons are there and the faster they flow, the higher is the counterflux of the adverse magnetic field. The potential (the number of electron charges in a volume) in the wire has to reach a minimum value, before switching another logic gate becomes possible. That means burning power: as (re)charging costs energy. Just changing the charge of 25x25 um bond pad at high frequencies can take about 1 mW of power. All modern CPUs have a lot of signals, run at ultra short wave radio frequencies and have many pins in pin grid packages. They are difficult (costly) to fabricate, they need a lot of power and their bond pads gobble up silicon die space, which becomes unavailable for the CPU logic. We have seen that sending a signal across the wire burns power, the more power, the higher the frequency. Furthermore as wire geometries begin to reach radio wavelenghts the darn thing begins to act as an antenna and to radiate power into space, what impairs nearby telecomunication. As logic uses no sinus signals but (somewhat rounded because of above mechanisms) rectangles, which by Fourier addition theorem still contain a wide range of sinus waves. Speak: it is a radio sender and a wide-band one of that. Each bus wire begins to experience cross-talk from is neighbours, the signal-noise ratio turning rapidly abysmal. _Is_ there a better way to do things? Yes. When we switch to photons, the situation becomes drastically different. If we use a (integrated ones are best) laser talking through a fiber, we have only one problem of converting our electric signal to light. Once there, the signal speds of at the speed of light in glass (which is, while somewhat lower than in vacuum, being 299 792 458 m/s, is still _quite_ impressive), experiencing none of the above problems. There is no interfiber crosstalk, only at the source (laser) and the detector. Since the light is only slowly dimmed by the fiber, it can travel much longer distances. When using high-quality fiber and (not semiconductor) lasers they sent it up to 200 km (and even more) _without repeaters_. Moreover, one could dote a length of fiber with neodym or any such thing and pump it with a semiconductor IR laser. The doted length of fiber thus becomes a laser cavity and amplifies incoming weak signals by stimulated photon radiation. As glass is a very poor conductor of electricity, potential differences up to several kV can easily exist between sender and emitter without any of them ever noticing it. This current loops preventing ability is widely used by optocoupler utilizing devices. A good monomode fiber has a potential bandwidth of about 100 THz (!). Today's fastest detectors and lasers operate at only 10 GHz. There is still quite a lot of head room for growth, huh? One can now integrate as many as one million of tiny lasers on _one_ chip today. With _smart pixels_, integration of lasers, detectors and processing electronics one can manage complexites of 20 000 transistors on GaAs chips. There. Now I did say it: gallium arsenide. Unfortunately, silicon won't lase. (Nanoporous silicon can emit light. But it is no laser.) Photodiodes are possible in Si, though. One has to use exotic III-V semiconductors as GaAs and InP and the like to see some IR or visible light. (Those semiconductor lasers are just glorifed LEDs (light emitting diodes) by the way.) These III-V semiconductors are tricky. One has problems making them as they are not elements but compounds of two elements. As gallium and arsenic have different volatilites, GaAs just decomposes when heated. GaAs is brittle. Gallium metal is dear. Growing big crystals of GaAs is tricky, too. And to make the bitter cup overflow, GaAs does not form nonconductive protective layers as silicon when heated, the reason semiconductor people made silicon their favourite in the first place. These are some of the reasons the GaAs chips are still not quite there. Some exotic Crays do use GaAs gate arrays, though. Well, one can always fabricate an array of GaAs lasers, cut it up and bond them upon silicon as a hybrid circuit. This is not elegant, it costs some bucks more, but hell! it works with today's technologies. These glass optofibers are today almost unrivaled designs for very high speed long range communications. But one should use cheap plastic optofibers for short (about 1 cm - 10 m links, speak: interchip and hardware device connection. They could be cut with a simple tool and glued or melted (thermoplastic fibers) upon smart pixels. A single serial optic fiber link will laugh itself silly looking at PCI's or VLB's bus performance. To wit: - Buses are slow. - Buses eat power and make heat and radio waves. - Buses need big areas on PC boards. - Buses will have to go. (On the long run, at least. The FireWire guys had gotten _that_ right. But it is a high speed serial link and they still won't use optics. There's an optical SCSI successor in the pipeline, though). One should not abandon electrical signaling altogether. Whenever _tiny structures_ of very small capacity/inductivity of very short wire lengths (again small capacity/inductivity at reduced signal propagation delays) are needed, electrical signaling is still unavoidable. 2.3 Why is Neurcomputing Hard? As I have argumented in my previous ("von Neumann AI-inadequate") posting, von Neumann machines have are very limited processing power, as defined by a product of the mean translatory speed through n-state-space and the (e.g. Shannon) path entropy. This factor is mostly commonly referred to as limited memory bandwidth. The currently utilized cache hierarchy scheme fails utterly when the need is to process nonlocal data. The real-world bandwidth of a well-designed Pentium board lies about 30 MByte/s. The plethora of new RAM architectures awaiting their financial breakthrough in the course of this year boosts memory performance somewhat but does not eliminate the basic problem altogether. A comparison of biological and silicon way of doing computation: key features of bioNNs: - slow signal propagation velocity (about 100 m/s for myelinized (Swann cell) axons, 1 ms (1 kHz) spikes). - very slow switching time (10-100 ms integration window at the synapse due to postsynaptic cleft neurotransmitter diffusion time) - very high connectivity: average branching factor 10^3-10^4, top convergence factors up to 10^5 (!) - very low power/volume dissipation - extremely high performance - very high number of elements - very dense packaging - 3d packaging on a (noisy) grid, hence much shorter (as compared to 2d) avg. link length - functionally robust (noisy input, physical (hardware) damage) due to redundant holographic representation - excellent adaptivity (both hardware and software level) - low linearity/dynamic range at neuron level - mechanically/temporally instable - currently cannot be expanded/modified, though potentially able to take advantage of eventual additions (neuroplasticity) - morphologically/functionally distinct functional blocks/modules (up down to cell level), highly preserved in the evolutionary process - neocortex: hypergrid connectivity of individual neurons (connectivity about 10^3), layered structure consisting of overlapping modules photolithographically constructed transistor chips on the other hand: - low integration density (since 2d) - regular 2d grid, thus longer links - very low connectivity (due to fan-out and wiring constraints (2d)) - extremely short switching time - high power dissipation - physical robustness - easily expandable through usage of blocks with well-defined interfaces - susceptable to defects (architecture problems) What does all this suggest? If we are going to emulate bioNNs in silicon at mammal cps (connections per second) performance (transiently, until we have molecular circuitry) and significant neurons/area density, we will have to do the following things: - use a representation/architecture which will hold all the way down to molecular circuits - time-multiplex existing structures (trading speed for higher (virtual) connectivity) - use DRAM (dense) or SRAM (fast) memory cell to hold edge & Co data - use an optimum die size for WSI both to keep defects down and boost performance (sequential emulation grain size) - use hypergrid topology with serial/optical inter-die links and integrated routers (later more on that) - interconnect wafers with optical fibres for high bandwidth/ low power dissipation >From my estimations it seems to be possible to integrate the computational equivalent of about one cubic millimeter of the neocortex on _one- wafer at contemporary/near future integration density. But it won't scale to, say, 1 MWafers very well, and achievable complexity (due to financial/power dissipation constraints) seems to be limited to the mouse equivalent. To upload humans we will need molecular circuitry. Rate This Message: http://www.cryonet.org/cgi-bin/rate.cgi?msg=4168