X-Message-Number: 4168
Date: Fri, 7 Apr 1995 19:49:49 +0200 (MET DST)
From: Eugen Leitl <>
Subject: uploading vistas 

Cellular Automata, Soft Circuits, Neural Nets & Uploading.


0. Disclaimer

I want to present here some computational ideas I deem relevant
for the uploading (sub)community in the cryonics movement. This
is going to be a series of longish posts, having only loose
links to the cryonics mainstream, so I apologize in advance.
Moreover, there is a lack of consistancy, since they are pasted
parts of a larger paper. Tell me to quit if I am wasting
bandwidth of this list, and I will.

1. PalaeoComputing: The Age of the Dinosaurs.

The CA computation paradigm, akin to the Turing machine is not
exactly new to the computer society, having been invented by
John von Neumann in the 50's in his quest of building a
self-replicating machine (recent xref: nanotechnology (NT)).
While the Turing automaton has had a heavy impact upon the way
we do computations (all current machines are, at their heart,
Turing automatons) CAs, until recently, were quite unknown.

What makes all computers in existance seem as poor copies of the
Turing engine?

The random access memory (RAM, core, store) is linearly
addressable, one cell at a time, which makes it equivalent to
the Turing tape (it's finite, though). That it is random-access
is quite irrelevant as it can be simulated by the Turing machine
through macros. The CPU has a local state, it can do reads and
writes to the "tape": RAM. All additions, as caches, hard disks,
tape drives, etc. are only irrelevant artefacts of an efficient
real-world implementation.

The von Neumann architecture (which is, ironically the name
given to an architecture based on the Turing engine) has a
serious flaw, which has become prominent only in the recent 5-10
years.

The duality of a smart central engine and a distinct dumb memory
where data/instructions are stored did made perfect sense, once
when data was recorded as sonic waves in mercury tubes (mercury
delay lines), as local changes in magnetic domains in a surface
of a rotating drum, or as electron buckets on a surface of a
special CRT or as magnetisation of small ferrit cores till in
the late 60's. Tubes or single transistors, were much too costly
to become the basis for RAM in these days.

With the advent of RAM transistor ICs, where the same elements:
normally-open-switches are the basis of both the CPU and the
memory this natural distinction became obsolete.

Of course, the mainstream plodded on in the same direction, the
investments in programmer training/hardware as well as simple
human conservativism being its carrying momentum.

The crisis became visible only in the last decade, where the CPU
speeds began to overtake RAM speeds: the von Neumann bottleneck
became prominent. There is every reason to suspect that our
current designs begin to be vastly ineffective with the advent
of modern 32-bit CPUs as the i386 or its RISC cousins.

Some symptoms of the creeping disease:

1) The total amount of transistors/system vs. performance
   curve shows distinct saturation, most evident in the
   8080/8088/286/386/486/586/P6 (Intel) or the
   6809/68000/68020/68030/60040/60060 (Motorola)
   evolutionary (?) sequence. Other CPU lines show
   the same trend. Even the (still) exponential
   MIPS/Buck function shows signs of slowing down,
   as the semiconductor photolithography rapidly
   approaches physical limits.

2) The time needed for a total memory search has
   _decreased_ constantly over last 15 years. The real-world
   memory bandwidth (only 30 MByte/s for a well designed
   Pentium board as to 15 MByte/s to a i486 has hardly
   risen at all when one considers concurrent bus width
   expansion from 32 to 64 bit).

   This trend would be much more visible if current RAM prices
   wouldn't be artificially (price speculation/bad system design)
   held up. A modern machine should have about 32-64 MByte
   instead of 4-8 MByte by now, if we cling to former extrapolations.

   The hierarchy of caches cloaks this phenomenon, but the
   discrepancy between CPU and memory speed becomes even more
   prominents when caching fails. (And for a big problem
   class it does).

3) The amount of transistors devoted to pure computation
   (ALUs, registers, glue logic, etc.) has decreased
   constantly. One has trouble locating the CPU among
   all the caches, branch prediction units and the accomanying
   glue logic. I would estimate the core CPU (sans FPU) taking
   about 5-10% (sic) of die size of latest (recent BYTE
   issues) designs.

There is more of above, but by now I think my salient point(s)
is/are obvious:

Our present computers have become increasingly inefficient for
certain tasks (as e.g. NN emulation, connectionist AI, fluid
dynamics, etc.) and grow even more so with each generation. The
reasons are partly the lack of innovation/investment
conservativism, partly the tremendous commercial success, since
these machines _can_ do, what they were tailored to do, very
well. Bureau suites, e.g (PCs). Or matrix math (scientific
supercomputers). The "all-purpose-computer" idea begins to sound
somewhat hollow.

2. The Shape of Things to Come

Things aren't that gloomy as they appear to be. Particularly,
Texas Instruments in its DSP line seems to converge to multiple
DSPs with integrated RAM on one die, a potentially very
promising approach. Several obscure chip lines, as e.g. based on
the Novix Forth chip by Chuck Moore offer significantly more
bang for the transistor buck. The Inmos (Inmos/SGS Thompson now)
Transputer chip line, though still severely impeded by recent
privatisation (thanks M. Thatcher for that) prepares the launch
of its Chameleone chip. Several special-purpose
neuroarchitectures, as Siemens' Synapse 2 engine have been built
recently. In recent MIT Press publications on connectionist AI
even wafer scale integrated neurochip engines have been
introduced (5-8 wafer, 55 kTransistors/neuron, 8 MHz shared bus,
bult from die segments).

Why is WSI one of the key features of near future's computers?

2.1 The Lure of Wafer Scale Integration

Current chips are produced en bloc (32-64 dies apiece) by wafer
steppers through semiconductor photolithography, the silicon
wafer then cut up with a diamond, tested, packaged and tested
again. Because of the random defects roughly 1 cm^-1 inherent to
the production process, usable chip yields decrease
exponentially with the die size. Defective chip yields of a
standard die lie often beyond 50-90 %, the chances of producing a
defect-free wafer sized circuit resemble those of encountering a
snowball in hell. Since chip production requires a large number
of steps, each needing large amounts of very pure and often
highly toxic chemicals (thus being an extremely environmentally
damaging industry), reducing defective chip yields even slightly
would be highly desirable.

Fault tolerance of a wafer-scale integrated circuit must be
achieved by _redundancy_ and _self-healing_ at the level of the
individual die as well as at wafer scale. In highly repetitive
structures (particularly memory cell arrays are highly
redundant) self-healing can be achieved by bad cell remap via
software. By keeping the accompanying CPU area small, random
defect hit probability, which would render the die inoperable,
can be greatly reduced. DRAM area defects are not lethal as they
are mapped and marked as unusable by software. Finally, by
introduction of die self-test and remap at wafer level _several_
defective dies on one wafer are tolerated.

Since cutting up and repackaging is a major source of costs and
introduces additional defects, usage of the wafer as whole
element is highly desirable.

As we will see, devising WSI-appropriate hardware and software
architectures is not such a big problem as one might imagine.

2.2 Why They Are So Slow

All modern systems suffer from a kind of shizophrenia: the basic
elements our computers are built from are fast, and consume
little power whereas the computers themselves are both slow and
dissipate a lot of heat.

A silicon tunnel diode based on quantum effects works at 3 THz
(Tera Hertz, 3*10^12 /s), a ring oscillator or transistor still
switches at 0.3 THz (300*10^9 /s), whereas a modern PC's CPU can
do memory accesses at 0.00003 THz (30*10^6 /s) Just
count the zeros. A whole _lot_ of them, don't you think so?

There are two simple reasons for this.

The first: our CPUs have _Complex Designs_. Yes, even the RISC
ones. The logic gate delay times get added up arithmetically. A
complex logic engine needs a whole number of clock ticks to
settle into its final state. Reading its state before is risky:
the result could well be still in the making.

The second: there is the _Problem of Locality_. As memory chips
and the CPU sit quite apart from each other, the signal has to
propagate through the wires first, is then being processed
(relatively fast) by the CPU and gets RAMmed back through the
bus. The propagation velocity of the electric signal lies well
below the speed of light limit. Why? A wire has a capacity and a
inductivity both. As electrons move, they generate magnetic
field opposing their movement. The more electrons are there and
the faster they flow, the higher is the counterflux of the
adverse magnetic field. The potential (the number of electron
charges in a volume) in the wire has to reach a minimum value,
before switching another logic gate becomes possible. That means
burning power: as (re)charging costs energy. Just changing the
charge of 25x25 um bond pad at high frequencies can take about 1
mW of power.

All modern CPUs have a lot of signals, run at ultra short wave
radio frequencies and have many pins in pin grid packages. They
are difficult (costly) to fabricate, they need a lot of power
and their bond pads gobble up silicon die space, which becomes
unavailable for the CPU logic.

We have seen that sending a signal across the wire burns power,
the more power, the higher the frequency. Furthermore as wire
geometries begin to reach radio wavelenghts the darn thing
begins to act as an antenna and to radiate power into space,
what impairs nearby telecomunication. As logic uses no sinus
signals but (somewhat rounded because of above mechanisms)
rectangles, which by Fourier addition theorem still contain a
wide range of sinus waves. Speak: it is a radio sender and a
wide-band one of that. Each bus wire begins to experience
cross-talk from is neighbours, the signal-noise ratio turning
rapidly abysmal.

_Is_ there a better way to do things? Yes.

When we switch to photons, the situation becomes drastically
different. If we use a (integrated ones are best) laser talking
through a fiber, we have only one problem of converting our
electric signal to light. Once there, the signal speds of at the
speed of light in glass (which is, while somewhat lower than in
vacuum, being 299 792 458 m/s, is still _quite_ impressive),
experiencing none of the above problems. There is no interfiber
crosstalk, only at the source (laser) and the detector. Since
the light is only slowly dimmed by the fiber, it can travel much
longer distances. When using high-quality fiber and (not
semiconductor) lasers they sent it up to 200 km (and even more)
_without repeaters_. Moreover, one could dote a length of
fiber with neodym or any such thing and pump it with a
semiconductor IR laser. The doted length of fiber thus becomes a
laser cavity and amplifies incoming weak signals by stimulated
photon radiation. As glass is a very poor conductor of
electricity, potential differences up to several kV can easily
exist between sender and emitter without any of them ever
noticing it. This current loops preventing ability is widely
used by optocoupler utilizing devices. A good monomode fiber has
a potential bandwidth of about 100 THz (!). Today's fastest
detectors and lasers operate at only 10 GHz. There is still
quite a lot of head room for growth, huh?

One can now integrate as many as one million of tiny lasers on
_one_ chip today. With _smart pixels_, integration of lasers,
detectors and processing electronics one can manage complexites
of 20 000 transistors on GaAs chips.

There. Now I did say it: gallium arsenide. Unfortunately,
silicon won't lase. (Nanoporous silicon can emit light. But it
is no laser.) Photodiodes are possible in Si, though. One has to
use exotic III-V semiconductors as GaAs and InP and the like to
see some IR or visible light. (Those semiconductor lasers are
just glorifed LEDs (light emitting diodes) by the way.) These
III-V semiconductors are tricky. One has problems making them as
they are not elements but compounds of two elements. As gallium
and arsenic have different volatilites, GaAs just decomposes
when heated. GaAs is brittle. Gallium metal is dear. Growing big
crystals of GaAs is tricky, too. And to make the bitter cup
overflow, GaAs does not form nonconductive protective layers as
silicon when heated, the reason semiconductor people made
silicon their favourite in the first place.

These are some of the reasons the GaAs chips are still not quite
there. Some exotic Crays do use GaAs gate arrays, though.

Well, one can always fabricate an array of GaAs lasers, cut it
up and bond them upon silicon as a hybrid circuit. This is not
elegant, it costs some bucks more, but hell! it works with
today's technologies.

These glass optofibers are today almost unrivaled designs for
very high speed long range communications. But one should use
cheap plastic optofibers for short (about 1 cm - 10 m links,
speak: interchip and hardware device connection. They could be
cut with a simple tool and glued or melted (thermoplastic
fibers) upon smart pixels.

A single serial optic fiber link will laugh itself silly looking
at PCI's or VLB's bus performance.

To wit:

- Buses are slow.

- Buses eat power and make heat and radio waves.

- Buses need big areas on PC boards.

- Buses will have to go.

(On the long run, at least. The FireWire guys had gotten _that_
right. But it is a high speed serial link and they still won't
use optics. There's an optical SCSI successor in the pipeline,
though).

One should not abandon electrical signaling altogether. Whenever
_tiny structures_ of very small capacity/inductivity of very
short wire lengths (again small capacity/inductivity at reduced
signal propagation delays) are needed, electrical signaling is
still unavoidable.

2.3 Why is Neurcomputing Hard?

As I have argumented in my previous ("von Neumann
AI-inadequate") posting, von Neumann machines have are very
limited processing power, as defined by a product of the mean
translatory speed through n-state-space and the (e.g. Shannon)
path entropy. This factor is mostly commonly referred to as
limited memory bandwidth. The currently utilized cache hierarchy
scheme fails utterly when the need is to process nonlocal data.
The real-world bandwidth of a well-designed Pentium board lies
about 30 MByte/s. The plethora of new RAM architectures
awaiting their financial breakthrough in the course of this
year boosts memory performance somewhat but does not eliminate
the basic problem altogether.


A comparison of biological and silicon way of doing computation:

key features of bioNNs:

- slow signal propagation velocity (about 100 m/s for myelinized
  (Swann cell) axons, 1 ms (1 kHz) spikes).

- very slow switching time (10-100 ms integration window at the
  synapse due to postsynaptic cleft neurotransmitter diffusion
  time)

- very high connectivity: average branching factor 10^3-10^4,
  top convergence factors up to 10^5 (!)

- very low power/volume dissipation

- extremely high performance

- very high number of elements

- very dense packaging

- 3d packaging on a (noisy) grid, hence much shorter (as
  compared to 2d) avg. link length

- functionally robust (noisy input, physical (hardware) damage)
  due to redundant holographic representation

- excellent adaptivity (both hardware and software level)

- low linearity/dynamic range at neuron level

- mechanically/temporally instable

- currently cannot be expanded/modified, though potentially able
  to take advantage of eventual additions (neuroplasticity)

- morphologically/functionally distinct functional blocks/modules
  (up down to cell level), highly preserved in the evolutionary
  process

- neocortex: hypergrid connectivity of individual neurons (connectivity
  about 10^3), layered structure consisting of overlapping modules


photolithographically constructed transistor chips on the other hand:

- low integration density (since 2d)

- regular 2d grid, thus longer links

- very low connectivity (due to fan-out and wiring constraints (2d))

- extremely short switching time

- high power dissipation

- physical robustness

- easily expandable through usage of blocks with
  well-defined interfaces

- susceptable to defects (architecture problems)


What does all this suggest?

If we are going to emulate bioNNs in silicon at mammal cps
(connections per second) performance (transiently, until we have
molecular circuitry) and significant neurons/area density, we
will have to do the following things:

- use a representation/architecture which will hold all
  the way down to molecular circuits

- time-multiplex existing structures (trading speed for
  higher (virtual) connectivity)

- use DRAM (dense) or SRAM (fast) memory cell to hold
  edge & Co data

- use an optimum die size for WSI both to keep defects
  down and boost performance (sequential emulation grain
  size)

- use hypergrid topology with serial/optical inter-die links
  and integrated routers (later more on that)

- interconnect wafers with optical fibres for high bandwidth/
  low power dissipation


>From my estimations it seems to be possible to integrate the
computational equivalent of about one cubic millimeter of the
neocortex on _one- wafer at contemporary/near future integration
density.

But it won't scale to, say, 1 MWafers very well, and achievable
complexity (due to financial/power dissipation constraints)
seems to be limited to the mouse equivalent. To upload humans we
will need molecular circuitry.


Rate This Message: http://www.cryonet.org/cgi-bin/rate.cgi?msg=4168