Fast 3d party IP

[Dec 28, 2012] OR the external Intellectual Property which makes Allwinner’s unprecedented pace of further next-gen SoC introductions possible despite of the company size of only 500 employees. Preliminary reading: The future of the semiconductor IP ecosystem [my other, ‘Experiencing the Cloud’ trend tracking blog, Dec 13, 2012] which will provide quite an extensive background for how Allwinner was able to produce its A10, A13 and now A31 SoCs in the last 2 years when it had even less people. But now let’s see the proofpoints for the unremitting continuation of that as early as during the next, year 2013. This will go not only as far as achieving in 2013 as much as 12 times of the Cortex-A8 performance with 65 nm  implementation, by using a combination of 2xCortex-A15 and 2xCortex-A7 with 28 nm implementation and still having the almost same energy consumption, but also to describing the capability of adding a true x86(x64) VIA Eden X2 silicon to a derivative of A31 in order to match Intel’s Z2760 (Clover Trail) SoC at much lower cost, as early as in January 2013, and thus opening a quite promising, new avenue for delivery of lower cost Windows 8 hybrid tablets on the market.

This whole page is therefore organised into three sections:

  1. Achieving 12 times of the Cortex-A8 performance with almost the same energy consumption
  2. Skipping Cortex-A9 after using Cortex-A8 in A10 and A13 with the silicon cost of all four Cortex-A7 cores being four time less than the single A8 core
  3. Matching Intel’s Z2760 (Clover Trail) SoC at much lower cost for Windows 8, and with the same 4K UHDTV capability as Intel’s next-gen Haswell SoC

1. Achieving 12 times of the Cortex-A8 performance with almost the same energy consumption

Here is the initial state of the 3d party system IP which quite probably will form the basis of Allwinner’s next-gen SoC, i.e. the chip which will as fundamentally extend its SoC offerings as the latest A31 did in December this year. image
Source: CoreTile Express A15x2 A7x3 Power Management [ARM Application Note #318, Oct 15, 2012] which says

The CA15_CA7 test-chip provides isolated power domains for the CA15 cluster, CA7 cluster and the SOC. These are supplied by PSU0, PSU1 and PSU2 respectively on the V2P-CA15_A7 board. The DDR and VIO domains are supplied from PSU3 and VIO.

This is a test chip by ARM Holdings, the same leading IP vendor that provided the IP for Cortex-A8 and Cortex-A7 CPU cores, as well as the Mali-400 GPU cores to Allwinner’s current A10, A13 and A31 SoC products. The chip (in fact an SoC itself) contains two CPU clusters: the one identified on the above block diagram as 2 x CA15 contains two cores while the other one, with 3 x CA7 indentification three cores of another type. Placed on the CoreTile Express A15x2 A7x3 (V2P-CA15_A7 HBI-0249) board this test SoC was developed by the IP provider to enable:

  • Exploration of the big.LITTLE architecture
  • Boot code, hypervisor, cluster-switching, device driver and application software development
  • Software debug and development of software tools through the on-chip CoreSight™ debug and trace infrastructure

as reported in Cortex-A15 and Cortex-A7 big.LITTLE hardware from ARM [Software Enablement blog of ARM, Nov 5, 2012] post.

What ARM is calling “big.LITTLE architecture” is based on the fact that in order to continue the performance increase of the ARM based SoCs and maintain as low power requirements as possible, the processing in a multi-core chip should not only be shared among same type of cores, but also between two quite different types:
– one type excelling (i.e. optimized) in energy efficiency for those workloads which do not require much CPU performance, this the “LITTLE” part of the architecture, and
– a second type excelling (i.e. optimized) in performance for those workloads for which much greater performance is required, this the “big” part of the architecture.

Both types of cores are already developed by ARM and their delivery is just started by the chip vendors:
– the Cortex-A7 (CA7), first implemented in Allwinner A31 (quad-core) SoC which led to a number of Chinese tablet products in December, is the one which ARM is calling its most energy efficient core so far;
– the Cortex-A15 (CA15), first implemented in Samsung Exynos 5 (dual-core) SoC which was brought to the market in joint by Google and Samsung Nexus 10 tablet in November, is the high-performance one.

ARM was presenting the following performance results of these two types of cores (in terms of different known benchmarks) against the highest performing ARM core in SoCs so far, the Cortex-A9:

image
noting that:

    • Cortex-A7 delivers performance approaching Cortex-A9
    • …At much lower power and area

Source: Advances in big.LITTLE Technology for Power and Energy Savings [Brian Jeff presentation for ARM Technology Symposiums first held on November 5 in Bangalore India,  published on Dec 5, 2012 for the Nov 8 Hyderabad event].

We should add to that the following Power/Performance curve for the two types of cores from the earlier, Big.LITTLE Processing with ARM Cortex™-A15 & Cortex-A7 [ARM whitepaper by Peter Greenhalgh, published on Oct 24, 2012]:

image
This shows that Cortex-A15 has a steeper curve, i.e. its power is growing faster with the performance increase, so there is indeed another good reason (besides splitting the performance/power territories) to have two types of cores.

Each individual curve is representing the power regulation in terms required performance by a method called Dynamic Voltage and Frequency Scaling (DVFS). Brian Jeff described this method in his ARM’s ‘big.LITTLE’ processor taps DVFS to save energy [EE Times Asia. Oct 11, 2012] article as:

Dynamic Voltage and Frequency Scaling (DVFS) is a technique employed by pretty much every mobile phone in production today. In this approach, the voltage and frequency of the applications processor are scaled based on the performance needs at a particular instant in time. This is implemented in Linux, for example, in a kernel space driver called cpu_freq. This driver samples the operating system load every 50ms or so – and based on the OS load and the power policy – makes a decision to ramp up, ramp down, or stay at the same voltage and frequency operating point. In this way, the applications processor can respond dynamically to the performance needs of the device, and save energy during periods where the processor is on but waiting for input, lightly loaded, or dealing with background activity. Big.LITTLE processor technology makes use of these existing DVFS mechanisms to save even more energy.

So the test chip made available on the board product with full suite of capabilities to experiment with big.LITTLE architecture and developing suitable control software by the SoC vendor is marking the last phase of the realization of combined big.LITTLE SoCs. The first such product most definitely will come out from Samsung, as the company generally ahead of every other SoC vendor to exploit the big.LITTLE opportunity. This product announcement could be as early as in January on CES 2013 keynote by the president of the System LSI Business of Samsung Electronics’ Device Solutions, Dr. N-S (Stephen) Woo, where the SoC products belong to. The event teaser video is already talking about things like “taking giant leaps forward”, and a 28-nm SoC with two quad-core clusters will be detailed at ISSCC in February.

Given Allwinner’s excellent track record this kind of “giant leap” could be done by them as well, and as early as in H1 CY2013 a big.LITTLE product most probably named A40 may be introduced by them on the market.

Still, even after that some –otherwise well prepared– technical experts could have the following misperception (as proven by an actual response, translated below from the original language):

This big.LITTLE thing is quite exciting. But for the time being I don’t see the point of it beyond tablets. As Apple didn’t see the point of it and rather did a hacked to the maximum ARM9 called A6, instead of putting the A15 into the phone. The result? Almost the performance of A15, in some tests even better than that, in addition to a relatively reasonable power consumption. It’s enough just to look at the comparison tests of Nexus10 and the 4th-gen iPad .

If they could resolve that the A15 would not make the phone battery dead too quickly that would be a Canaan.

Funny (as this was not clear so far for readers with such a misperception), but that is the exact reason why big.LITTLE was introduced. The below graph is showing that quite clearly:

image

Source: ARM Cortex-A57 – So Big is Relative but How Relative is Your Big? [ARM’s Ian Forsyth on SoC Design blog of ARM, Oct 30, 2012]

More information (in addition to those already linked above):

This video provides an overview of ARM big.LITTLE technology and shows an operating demonstration, based on a Cortex-A15 and Cortex-A7 processor in a big.LITTLE configuration can provide optimum performance and reduce energy consumption.

This is a quite significant help for the SoC company to develop and debug power management software running on the V2P-CA15_A7 CoreTile, and thus for the big.LITTLE part of its future SoC (note that on that SoC would be other parts as well, such as the GPU. the video engine etc.).


2. Skipping Cortex-A9 after using Cortex-A8 in A10 and A13 with the silicon cost of all four Cortex-A7 cores being four time less than the single A8 core

To understand the current Allwinner situation and further prospects with A31 let’s see the following performance comparison of the leading ARM cores as of today:
image
Source: Advances in big.LITTLE Technology for Power and Energy Savings [ARM whitepaper by Brian Jeff, published on Dec 3, 2012] where:

Across a range of benchmarks, the Cortex-A15 delivers roughly 2x the performance of the Cortex-A7 per unit MHz, and the Cortex-A7 is roughly 3x as energy efficient as the Cortex-A15 in completing the same workloads.

is the conclusive note by the author about this graph. For me, however, this graph shows also that a quad-core Cortex-A7 SoC at a certain higher frequency can have the exactly same performance than a quad-core Cortex-A9 SoC.

This is important not only for the already very high-volume Android market, but for the just opening “Windows Phone 8 / Windows RT / Windows 8” market as well. For me it is showing that the current Cortex-A9 based Windows RT SoC, NVIDIA Tegra 3 [Feb 27, 2012] can easily be beaten with Allwinner A31 as in quad-core configuration (like the Microsoft Surface) Tegra 3 is running at 1.3 GHz for Windows and therefore only a 1.6 GHz Cortex-A7 quad-core is needed at maximum for that. While Allwinner is carefully avoiding any nomination of the A31 clock frequency, with deep search on the web there are a couple of indications that it is 1.6 Ghz for the currently available high-end Android tablet, the Onda V972.

This has a number of SoC advantages:

ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process [in fact 0.45mm2 without L2 cache]. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 – 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

  • Or in other terms the cost of four CPU cores in A31 will be about 4 times less (as 28nm will cost somewhat more because it is a newer node with more demand than 40nm) than the cost of four CPU cores in Tegra 3, as supported by this quote from here (supported by a number of other places as well):

Looking at size ARM Cortex-A9 is about 2.5mm2 at 40nm [versus Cortex-A7 size being in fact 0.45mm2 at 28nm]

This simple reasoning was definitely behind the decision that Allwinner simply skipped Cortex-A9 after using Cortex-A8 in A10 and A13 (which are produced at 55nm, nevertheless at whopping 1.5GHz for A10).


3. Matching Intel’s Z2760 (Clover Trail) SoC at much lower cost for Windows 8, and with the same 4K UHDTV capability as Intel’s next-gen Haswell SoC

But there are much more aspiring considerations as well behind the decision outlined above, as I will explain first in a couple of points:

  1. With A31 as described above Allwinner produced a full Windows RT capable SoC. Windows RT is the version of Microsoft Windows 8 produced for ARM-based SoCs like NVIDIA Tegra 3. Although it does not run the legacy x86 (x64) software, only the new, Windows 8 only tablet software designed for the innovative new user experience (earlier named Metro) and applications ported by Microsoft or other, such as Microsoft Office now.
  2. Having this a device vendor could produce a full, x86 (x64) capable device if will add to a proper A31 derivative a sufficiently low-power and a sufficiently high-performance multiple x86 (x64) chip both integrated in a proper multi-chip carrier package (in a similar way as the Pentium D was). Taiwan based VIA Technologies actually has such a chip technology already which is ready for introduction as early as on the CES 2013. In fact the mastermind of such a chip fusion very likely will be Cher Wang who is chairman and controlling owner of both VIA Techologies and the the well known HTC company already excelling on the smartphone market but also in big need now to boost its portfolio against such giants as Samsung and Apple. It is much more than a simple clue leading to conclusion here that Allwinner vis-à-vis HTC on 2013 International CES [my other, ‘Experiencing the Cloud’ trend tracking blog, Dec 10, 2012].
  3. Allwinner already indicated on its A31 product page thatSupport Microsoft Windows 8”, not simply Windows RT, as otherwise would be only possible.
  4. It is yet another clue for me that Allwinner itself is a private company which was well founded for the first two years working on its market leading video technology only, and after that carrying out industry wide licensing of a number of leading CPU and GPU cores IP, which is costing quite a lot of money as well. In addition to all that Allwinner introduced its first big SoC product, the A10 chip at essentially dumping prices ($7 in volume quantities). This all required a lot of funding capital which may come only from a private investment fund like Cher Wang has as well (and with which she was also partially funding a GPU related IP company S3 years ago). So Cher Wang might be the major financier behind Allwinner as well! We will find out this as early as during CES 2013, with only two weeks ahead of us!

The integrated multi-chip carrier might look like this simple structure:

image

Below I will explain in detail why such a dual chip solution is quite feasible for the early January 2013 timeframe:

VIA Eden X2 Processors: Fanless Dual-core Performance [March 1, 2011]

Designed from the ground up for fanless implementation, VIA Eden X2 processors leverage the latest 40nm manufacturing process, combining two 64-bit, superscalar VIA Eden cores on one die.

All VIA Eden X2 processors … use a VIA NanoBGA2 package of 21mm x 21mm with a die size of 11mm x 6mm [66mm2 !! with a large L2 cache].

According to the latest VIA Dual Core Processor [Feb 20, 2012] datasheet:

Processor

VIA Eden X2
(with 2 x 1MB L2 cache)

VIA Nano X2
(with 2 x 1MB L2 cache)

Model

U4100E

U4200E

U4300E

L4350E

Speed

800MHz

1.0+GHz

1.2+GHz

1.6+GHz

FSB

533MHz

800MHz

1066MHz

1066MHz

VRM Type

For 7 bit only

For 7 bit only

For 7 bit only

For 7 bit only

TDP

5~6W

9W

13W

27.5W

Then there is VIA QuadCore Processor [Sept, 9, 2011]:

Part No.

U4650E

L4700E

CPU Speed

1.0+GHz (ULV)

1.2+GHz

Turbo Speed2

1.2GHz

1.46GHz

FSB

800MHz

1066MHz

TDP Power

18W

27.5W

L2 Cache

4M

4M

Tj Max

90C

90C

PWM Design Requirement

Single Phase (PMON)

Dual Phase (PMON)

where the performance is indicated as:

image

Now, Intel Atom Z2760 “Clover Trail” system-on-chip features two x86 cores based on Saltwell micro-architecture with up to 1.80GHz clock-speed. For Z2760 the reported PassMark CPU benchmark value is 679 at 1.8GHz. According to the above table the same benchmark value for VIA Nano dual core at 1.2GHz is 883.1. There is no TDP published for Z2760 yet, therefore I will take the TDP for the recent S1200 Series Atom products for the server since they have an absolutely similar microarchitecture (code named Saltwell) for CPU cores, manufactured with the same 32nm process. The TDP value for Z2760 is therefore 7.3W (the average between the 6.1W S1240 running at 1.6GHz and the 8.5W S1260 running at 1.8GHz) which means that a VIA Eden X2 at 0.973 GHz will have the same TDP, and its proportional PassMark CPU benchmark value will be 716, so by 5.5% higher than that of Z2760 at its maximum frequency.

More information about VIA Technologies and its Nano processor based products:
Can VIA Technologies save the mobile computing future of the x86 (x64) legacy platform? [my other, ‘Experiencing the Cloud’ trend tracking blog, Nov 23, 2012] from which I am including the following quote:

Currently, work is being done on the CN-R, a small quad-core processor designed for TSMC’s 28nm process. About a year ago, Centaur released a processor called VIA QuadCore, internally referred to as CN-Q, which is divided onto two chips, in a similar way as the Pentium D was. Each single chip is a VIA Nano X2 manufactured in 40nm that was well able to compete with other chips of its class, like the Atom D510 and the AMD E-350. It is compatible with the classic Eden boards and a quad-core solution for mini-ITX boards was released a few weeks ago, the VIA EPIA P910.

Centaur still doesn’t have an integrated memory controller, for external communication they are still using the VIA V4 bus with 1333MHz, which is mostly identical to the bus of the Pentium 4. It serves as a link to the much bigger companion chip from VIA with north and south bridge. However, according to Henry, there are plans to integrate the chips into a SoC. But first, the CN-R is supposed to hit the market in the classic format with clock speeds between 1.2 and 2GHz around mid-2013. Which market that’s going to be, Henry doesn’t know yet: tablets seem likely, the netbook market is all but dead, but there’s still a niche market for small desktop PCs [“mini PCs”] and mini-servers as well as the embedded sector.

A few highlights could make the chip stand out among the competition: AVX2 and an advanced PadLock unit with new cryptography operations – Atom and Bobcat/Jaguar don’t offer either. In contrast to Intel’s Haswell processor, however, CN-R will neither support fused multiply-add nor offer a transactional memory extension, as the effort would have been too expensive.

Just a few hours before I had arrived, another economically motivated cut had been decided: instead of the initially planned central L3 cache with 4MB, Centaur chose to go down to 2MB in order to save space, costs and, above all, energy. Centaur has to work on the latter in particular to be able to compete with the big players in the business.

This means that after the current VIA Eden X2 derivative will be paired with an Allwinner A31 derivative and put on a two-chip carrier package (in a way like the VIA QuadCore was produced) at the time of Intel Haswell introduction VIA’s new 28nm quad-core processor chip, code-named CN-R can be integrated in a same way with either an A31 follower with Cortex-A15 core[s] also included or with an A31 derivative chip. Even both products may be introduced.

See also the Intel Haswell: “Mobile computing is not limited to tiny, low-performing devices” [my other, ‘Experiencing the Cloud’ trend tracking blog, Nov 15, 2012] post. As it is described there only Haswell will have the same 4K UHDTV capability as we have now in A31.

More information about the current Intel Atom Z2760:
–  An Intel® Atom™ Processor Designed for Windows* 8 Tablet [Intel® Developer Zone blog, Dec 21, 2012]
Intel® Atom™ Processor Z2760 (1MB Cache, 1.80 GHz) [specifications on http://ark.intel.com/, Sept 27, 2012]
Intel’s biggest flop: at least 3-month delay in delivering the power management solution for its first tablet SoC [my other, ‘Experiencing the Cloud’ trend tracking blog, Dec 20, 2012]

One thought on “Fast 3d party IP

  1. Pingback: Allwinner interest in 2013 | USD 99 Allwinner

Leave a comment