Allwinner’s first big.LITTLE “Ultra Octa Core” A80 development board

Update: Allwinner A80 OptimusBoard big.LITTLE Octa Core A15/A7 [Charbax YouTube channel, Jan 21, 2014]

Allwinner launches A80 Octa Core ARM Cortex-A15 with ARM Cortex-A7 in big.LITTLE MP OctaCore configuration with the PowerVR6 series GPU with OpenGL 3.0 with GPU Compute. With potential support for Chrome OS, ultra-fast Ubuntu, advanced ARM projects to use extreme parallel processing. Allwinner is shipping these boards in Q1 2014, if anyone is interested to start developing with A80, they should contact Allwinner here: http://www.allwinnertech.com/en/awt/contact.html

This the first product to be delivered on schedule from the announced earlier roadmap of  Three octa-cores and two more quad-cores to enhance the Allwinner SoC portfolio in the next two and one year respectively [‘USD 99 Allwinner’, Oct 14, 2013]:

imageAllwinner “Ultra Octa-Core” A80 OptimusBoard to be Unveiled at 2014 CES [press release, Dec 31, 2013]

The 2014 International Consumer Electronics Show (CES) will take place in Las Vegas from January 7 to January 10. As the leading mobile application processor design company, Allwinner Technology will showcase during CES its latest flagship Ultra Octa Core A80 development board, named OptimusBoard, 4G tablets powered by Allwinner A31/A31s, and dual-SIM phablets sporting A23 dual core. Expo visitors are welcome to visit at LVCC-South Hall S-2, MP25563.

Touted as Ultra Octa Core, Allwinner A80’s CoolFlexTM – powered multi-CPU architecture enables low power, high peak performance devices. In addition, A80 packs a cutting-edge OpenGL ES 3.0 compliant, GPU compute optimized graphics processor to deliver the best performance while keeping the smallest area and lowest power consumption. More information on A80 octa core will be disclosed during CES.

image

One of the characteristics that makes big.LITTLE advantageous is the varied performance requirements of typical mobile workloads. The graph above shows the percentage of time spent in DVFS [Dynamic Voltage and Frequency Scaling] states, and in idle and full shutdown states, by two cores in a currently shipping Cortex-A9 based mobile device. In the diagram, the red color indicates the highest frequency operating point, while the green colored regions indicate the lowest frequency operating point, and colors in between represent intermediate frequencies. In addition to the DVFS states, the OS power management can idle a CPU. The light blue regions in the graph indicate this idle time (WFI [Wait-For-Interrupt] and OFF). When a CPU has been idled for a long enough period, the system power control software may take a core to a full shutdown to save leakage power. This is shown by the darkest color on the graph (Cluster Off).

It is clear from the graph above that the applications processors spend a considerable portion of time in lower frequency states across several common workloads. In a big.LITTLE system, the SoC would have the opportunity to run all but the dark red portions of the work on a lower power Cortex-A7 CPU.

image

In the following graph, more intense workloads are analyzed in the same way, and even in these cases there is significant opportunity to map frequencies below 1GHz to a Cortex-A7 processor, which is known to provide performance per clock with 5~10% of the Cortex-A9 processor.

image

Source: Software Techniques [ARM Holdings, April 2013]

Understand ARM big.LITTLE with Brian Jeff, Product Manager, Roadmap [Charbax YouTube channel, Nov 3, 2013]

Watch this overview of latest the ARM Technology with Brian Jeff, Product Manager at ARM. He talks about big.LITTLE, ARM Cortex-A series, 64bit, latest Mali Graphics implementations and more. Nearly a dozen designs are about to go through with big.LITTLE, it’s ready on production chips, shipping in devices on the market, and the new version of the software will make it into devices next year.

imagebig.LITTLE Software Evolution from big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling_final (pdf).pdf (2.2 MB) [ARM Techcon 2013 presentation, Nov 1, 2013]

From big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling – Techcon Presentation [ARM blog post, Nov 7, 2013]:

Q2. How does the software react to low intensity tasks?

A. The big.LITTLE Global Task Scheduling software keeps a load history for each thread, so low intensity threads (that complete quickly) are scheduled to LITTLE processors on subsequent runs. If their load profile changes they can be up-migrated. The combination of load tracking and load history allows the software to dynamically react to the performance needs of running code.

Q3. When will it be running in Android

A. Now. All of our results are running apps or benchmarks on Android, with the big.LITTLE Global Task Scheduling software applied as a patch set to the Linux kernel.

From big.LITTLE Technology: The Future of Mobile [ARM Holdings whitepaper, Nov 18, 2013]

Results

Figure 8 shows CPU and SoC level power savings for a variety of representative mobile use-cases. When compared to a system composed only of big Cortex-A15 processors, a big.LITTLE system running ARM big.LITTLE MP implementation shows substantial power savings.

image
Figure 8 big.LITTLE MP Power Savings compared to a Cortex-A15 processor-only based system

image

Figure 9: big.LITTLE MP Benchmark Improvements

Figure 9 shows how the big.LITTLE MP model benefits benchmarks. The comparison is between a big.LITTLE system composed of four LITTLE processors and four big processors and a system composed of only four big processors.

The software thread affinity management techniques discussed earlier result in substantial performance gains for threaded benchmarks where the number of threads is greater than four. In this situation on the system under test, big.LITTLE MP enables the use of more processors to aid the benchmark. Offload migration helps with spreading the number of compute intensive benchmark threads to the LITTLE processors when the big processors are busy and overloaded. Idle-pull migration results in the best utilisation of the big processors which effectively work as accelerators.

For those benchmarks with fewer threads, using big.LITTLE MP either provides no degradation or a marginal but noticeable improvement. Compared to the test system with only four big processors, the dynamic software thread affinity management will promote better utilisation of the big processors which will not be encumbered with low intensity and frequent running threads (such as system services) or interrrupts.

Conclusion

The ARM big.LITTLE MP technology has been well qualified with Android on multiple silicon implementations. The code is self contained and freely available as a drop-in into the vendor stack. It is interesting to note that the code doesn’t require any significant modification or tuning. The only requirement is that the platform board-support package be well tuned in terms of DVFS and idle power management, allowing the scheduler extensions to focus on getting the job done.

The big.LITTLE MP scheduler extensions are available in two forms:

  1. As a part of monthly Linaro Stable Kernel releases for the ARM TC2 platform. These releases, also known as LSK releases, contain a complete Android software stack for TC2 based on a very recent linux-stable kernel. The stack is available in source form and also as a pre-built binary set complete with boot firmware, boot loaders, ramdisk images and an Android root filesystem image.
    See https://releases.linaro.org/13.09/android/vexpress-lsk for details on the LSK.
  2. As an isolated patch set against the LSK’s kernel. See
    https://wiki.linaro.org/ARM/VersatileExpress?action=AttachFile&do=get&target=big-LITTLE-MP-scheduler-patchset-13.08-lsk.tar.bz2.

Introducing big.LITTLE [ARM Holdings microsite, Feb 26, 2013]

How it Works 
[Sept 2011]
Power and Energy Savings 
[Sept 2012]
Software Techniques 
[April 2013]

big.LITTLE™ is a state-of-the-art technology designed by ARM. With performance and data consumption predicted to increase by eight times, a future proofed energy efficient solution had to be found. That’s where big.LITTLE comes in.

By using multiple processors that are developed for different energy and performance budgets,big.LITTLE provides optimum performance and maximum efficiency. The first big.LITTLE solutions will pair the ARM Cortex-A15 processor with the ARM Cortex-A7 processor. This will be followed by a roadmap of new Cortex processors that build on big.LITTLE making it an important innovation for future mobile devices.

If you want to think big, think big.LITTLE – technology that enables the next generation of mobile gaming, bigger brighter screens and longer mobile battery life.

Scroll down to see what’s possible when you think big.LITTLE.

Think big

High performance

  • Welcome to the next generation of mobile gaming. Console-style graphics can now be integrated into games for mobile devices.
  • big.LITTLE technology saves up to 70% processor energy consumption for common workload applications, so you can now get performance without compromise.

image

Longer battery life

  • Mobile devices that can last for days without needing to recharge are now possible.
  • Devices can stay connected for longer and will be able to stay constantly updated.
  • big.LITTLE extends battery life by optimising performance for apps based on the amount of processing power they need.
  • big will take care of high-performance apps such as gaming and video, whileLITTLE will handle tasks such as texts and emails.

Key benefits

A long-term investment and part of a bigger strategy in simplifying software development and making the SoC far more efficient.

Allows all our semiconductor partners to reduce the energy consumption and increase performance, enabling consumers to do more for longer.

Enables seamless connectivity for consumers. We reach for our mobile every six minutes* and big.LITTLE caters to this need.

Sets a new standard for high performance and energy-efficient processing.

Transparent to apps, middleware and most parts of the operating system, so today’s software will work seamlessly.

Builds on ARM’s twenty years of experience in low-energy technology.

Saves up to 70% processor energy consumption in common workload tasks such as instant browser, whilst delivering console-quality gaming.

ARM scalable architecture that allows processors to be designed for different power and performance operating points.

Transparent to today’s diverse applications software, which can run on either of the identical processors.

*According to a study commissioned by Nokia in February 2013

Partnership

image

Partners using big.LITTLE technology are able to scale and offer product diversification in everything from tablets to smartphones. They can design once and choose many different performance points. In addition, the energy reduction they achieve can be used to add more and more functionality.

We are helping many OEMs by delivering low cost processors that can provide a range of high performance, low energy products and devices into the marketplace.

Our technology gives many members of the ARM ecosystem a chance to innovate, which ultimately provides massive breadth and diversity in a growing market where mobile is the primary computing device.

image

 

What is the latest progress on big.LITTLE technology? [ARM Processors > Blog, Nov 19, 2013]

The first few big.little devices are now shipping in volume production, and several new SoCs are coming to market soon that will accelerate the usage of this power and performance  optimizing technology in mobile devices. Two of the public platforms employing big.LITTLE technology are the Samsung Exynos 5420 and the Allwinner A80, and a new platform introduced by Mediatek that is sampling in pre-production tablet devices now.

image

There are nearly a dozen other platforms in various stages of development. While the first big.LITTLE systems have employed an equal number of big and LITTLE cores, up to the maximum of eight, newer big.LITTLE systems will soon be deployed with varying numbers of big and LITTLE ARM Processors, tailored to specific segments within the mobile market.

In parallel with all of this hardware innovation, ARM software engineers have been enhancing big.LITTLE software even further. As mentioned in earlier blogs, the global task scheduling software that enables asymmetric topologies, with different numbers of big and LITTLE cores, is now being deployed in production devices.

See the top 10 things to know blog for more details – http://blogs.arm.com/soc-design/1009-ten-things-to-know-about-biglittle/

Now that ARM software engineers have access to multiple silicon vendor platforms, the work has shifted from getting the software working to optimizing system tuning for a broad range of mobile use cases, finding ways to extract even more performance improvements and power savings from the technology. The current results are extremely promising, and future development is expected to unlock even further benefits of big.LITTLE making it an increasingly important mobile SoC technology.

For a background on big.LITTLE it might be useful to start with a few other blog entries on the topic:

http://blogs.arm.com/soc-design/828-biglittle-in-64-bit/

http://blogs.arm.com/soc-design/598-combining-large-and-small-compute-engines-arm-cortex-a7/

Also, a recent Google Hangout goes into a discussion of the key technical aspects of the technology: http://www.youtube.com/watch?v=mT87oi6fGGM

To give an update on big.LITTLE technology, I’d like to introduce 3 key cases of big.LITTLE operation, and provide detailed use case CPU activity data and power savings metrics for use cases that illustrate those cases. I will roll out these updates over the course of a few weeks in 3 separate blog posts on each of the key types of use case outlined below:

big.LITTLE in Operation

There are three categories of run-time behavior that are particularly interesting for exploring the benefits and optimization potential of big.LITTLE:

  1. High-intensity (bursty) workloads
  2. Sustained workloads (with power and thermal limits)
  3. Long-use low-intensity workloads

image

In the high intensity cases, big.LITTLE software is particularly well suited for responding to the bursts of peak performance as well as the troughs of lower required performance. Performance peaks can be allocated to Cortex-A15 class “big” CPUs, and the troughs can be addressed with Cortex-A7 class “LITTLE” CPUs. Because of the hardware cache coherency and the architecture of the global task scheduling software, work can be migrated very quickly to big processors, and high performance threads are identified by their load history and automatically started on big processors when they run. In these cases, the effectiveness of the software is determined by ability to react quickly to peak performance requirements (so as not to slow things down relative to a big core only system) and in the ability to effectively use LITTLE cores with big cores shut down during the troughs to save power.

In the sustained performance cases, the benefits of big.LITTLE are just becoming evident as real systems reach the market and are tuned to real-world workloads. One example here is mobile gaming: in mobile games, the graphics processor is operating at near peak capacity for much of the time, potentially consuming 80% of the SoC power budget. In a constrained thermal envelope, a reduced power budget for the CPU subsystem – big.LITTLE enables an important reduction in power for these use cases than allow the GPU to run faster and deliver a better mobile gaming experience under the same SoC power budget. It is also possible for big.LITTLE to enable a more optimal mix of compute resource for a significantly constrained power and thermal budget. Libraries are in development to exploit the optimal balance of thermal capacity between CPU and GPU – this paper will highlight two use cases that are already exhibiting improved performance through big.LITTLE power savings at the CPU subsystem level.

In the low intensity cases, big.LITTLE has the most obvious benefits, as these workloads can run entirely on LITTLE processors at lower operating voltages. In these cases, the effectiveness of the software is determined by ability to remain running on the LITTLE processors without waking up the big cores.

I will be publishing 3 subsequent blogs in the coming weeks describing the measurements from each of these three categories of operation. In the meantime, you can see the slides I presented on this topic at techcon2013 at the following link:  big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling – Techcon Presentation

big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling – Techcon Presentation

created by brianjeff on Nov 7, 2013 8:32 PM, last modified by brianjeff on Nov 7, 2013 8:32 PM Version 1

I’m attaching my presentation from Tuesday the 29th at Techcon 2013. It was a well attended session with several good questions at the end of the talk and in a closed door press briefing immediately following the talk. I’ll be posting a blog entry shortly with a more detailed description of the data I presented. For now, here are the slides for your review – happy to take questions and comments.

Please comment or like if you find these slides useful.

Some of the questions from the event:

Q1. Why didn’t ARM use a unified L2 cache in big.LITTLE?

The design of the L2 is a fundamental part of the micro architecture, and we optimized it differently for the “big” and LITTLE ARM Processors The L2 cache of the Cortex-A7 and Cortex-A53 is optimized for efficiency, while the L2 design in Cortex-A15 and Cortex-A57 processors is optimized for performance first. The L2 cache runs synchronous to the CPU frequency in the big and LITTLE CPU clusters to keep latency to L2 short (which is key to performance), and allows the asynch boundary (which adds latency) beyond the L2 so that the big and LITTLE CPU clusters can be dynamically scaled in voltage and frequency based on the needs of the application. We are looking at different cache hierarchies in the next generation of big.LITTLE systems, but the separate L2 cache for the clusters in big.LITTLE is a key element of the design that has a relatively small area impact (given the L2 size on the LITTLE CPU cluster is typically smaller than that of the “big” CPU cluster).

Q2. How does the software react to low intensity tasks?

A. The big.LITTLE Global Task Scheduling software keeps a load history for each thread, so low intensity threads (that complete quickly) are scheduled to LITTLE processors on subsequent runs. If their load profile changes they can be up-migrated. The combination of load tracking and load history allows the software to dynamically react to the performance needs of running code.

Q3. When will it be running in Android

A. Now.  All of our results are running apps or benchmarks on Android, with the big.LITTLE Global Task Scheduling software applied as a patch set to the Linux kernel.

more frequently asked questions can be found at – Ten Things to Know About big.LITTLE

big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling_final (pdf).pdf (2.2 MB)
ViewDownload

Advertisements

2 thoughts on “Allwinner’s first big.LITTLE “Ultra Octa Core” A80 development board

  1. Pingback: Allwinner interest in 2013 | USD 99 Allwinner

  2. Pingback: บอร์ดที่ใช้ชิป Allwinner A80 | Ultimateohm's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s