Digging a well is difficult to parallelize.The work of digging the soil in the hole is basically done alone.Other workers can only work to remove the excavated soil with a shovel.Therefore, even if the number of people in the hole is increased, the work does not end early.On the contrary, others get in the way and delay the work.In this way, there are tasks that are not suitable for paralleling.
On the other hand, some tasks are easy to parallelize.For example, drilling a waterway is a suitable work for parallelization.In this case, multiple workers can work side by side.
The above picture indicates a form of parallelization called MIMD (Multiple Instructure Multiple Data).Each worker can do different tasks in an independent unit.In this case, if there are four workers, the work will be completed in about 1/4 hours for one person.
SIMD (Single Instructure Multiple Data) can be compared when a single worker uses a shovel as shown below.
The SIMD unit can only do one type of calculation at a time, but multiple data can be processed in parallel.This type of instruction is widely used in many processor vector processing units.This is useful when the data is high, for example, the regularity of the data used, such as image processing, and needs to repeat the same processing for large amounts of datasets.However, in more general computing tasks, this model lacks flexibility and cannot be improved.
For this reason, we decided to choose the option of placing multiple full CPU sub systems on one chip, that is, creating a multi -core processor.You can improve performance by placing multiple cores on one chip.Each core is a complete CPU, and it can be operated individually or linked with other cores.
Different types of core can be placed on various combinations on processor chips, and processing can be assigned in various forms.
Homodian Smart Core processor has two or more processor core.Each core works autonomously and can communicate or synchronize with other core through various mechanisms such as shared memory systems and mailbox systems.Each processor has a unique register and functional unit, and some have local memory and cache.This core is called homogenius (same kind) because all of the core described above is the same type.
Another type of multi -core chip is called heterogeneus (heterogeneous) multi -core and has two or more different types of CPU cores.Normally, each installed core has completely different characteristics and can be used for various system processing needs.One example is a Bluetooth communication chip, one core is used for the management of the Bluetooth protocol stack, and the other core manages external communication, application processing, and human interfaces.This type of multi -core chip can be used for applications that require both a core with real -time performance and a core with a system management function.
Let's take a look at how these cores are used.If multiple cores run the same project code base, they are called symmetry multi -processing (SMP).Different core can execute different codes at the same time, but the code is created as a single project and is displayed separately to each core by some control programs such as real -time operating systems (RTOS).Inevitably, the core that works like this must be the same type.All cores use the same project code compiled for the same type of processor.
When multiple core or processors run different project applications, they are called asymmetric multi -processing (AMP).Each core can be synchronized or communicated as needed, but each executes different chords.Multiple cores run each project, so you can use different types of core (heterogenia score).But this is not a prerequisite.If you execute a project code with different core of two or more, it will be a homogenia score to execute AMP.
In the case of SMP operation, all core runs the same project code -based code, so you need multiple homogenia score.However, in multiple projects in which different core runs different chords, you can use different core as well as heterojinius systems.However, it is possible to execute the same core.
In the past few years, Moore's law, which was sung in the mid -1960s, is also reaching its limits.At least the momentum is dull.The clock rate of the processor is no longer doubled every two to three years, and the highest CPU has actually been peaked in the range of several GHz (first half of the digit) over the years.increase.
One of the ways to continue the limit of performance improvement is that it is a condition that it can be used efficiently, but to work with more CPU cores.
The speed is sideways, but the size of the transistor has been reduced.Although the speed is lower than before, if you use a small transistor, you can incorporate more logic on one chip.As a result, if you use these transistors to incorporate multiple CPU cores into one chip, you can use a much higher and larger bus connection between multiple CPUs and memory subsystems than before.
Asymmetrical multi -processing with a homogenia score is very effective when the application has two or more workloads that have a major difference in characteristics and conditions.As an example, one may have a significant effect of interrupt delay in real time, and the other is greatly affected by the throughput than the response time.This model works very well.For example, in a device, one core may be operated as an application processor that manages communication protocol stacks such as Bluetooth and Zigbee, and as an application processor that manages a human interaction and the entire system.By separating communication processors, you can provide the excellent real -time response required for the protocol stack.In addition, since the conformity to communication software standards can be verified, separating functional changes from this part of the system makes it easier to authenticate the entire product.
What are the problems if multiple cores are incorporated into one chip?Let's take a closer look.
Monolithic applications and software may prevent available computing resources efficiently.To use multiple core resources, you need to configure applications for a parallel task that can be executed at the same time.Therefore, software engineers may need to consider embedded design in unfamiliar ways.Transplanting existing single loop codes is probably not so easy.Too few threads or too much can cause performance obstacles.
Applications that share data structures and I/O devices between multiple threads and processes can have a serial bottleneck.In order to maintain data perfection, it is often necessary to serialize these shared resources by using locking mechanisms such as reading locks, read-out-writing locks, writing locks, spin locks, and mutex.there is.If the lock design is inefficient, the competition rate may increase between multiple threads and processes that try to acquire locks to use shared resources, resulting in a bottleneck.This can reduce the performance of applications and software.If a core is stalled and another core is waiting for a common lock, the performance is inferior to one core, despite the two cores working.As the number of core and processors increases, application performance may be further reduced.
Computing resources cannot be used efficiently even if the load is uneven.Large tasks must be divided into small tasks that can be executed in parallel.In some cases, serial algorithms must be changed to parallel algorithms to improve performance and scalability.However, even if such measures are taken, if the execution of a certain task is very fast, but it takes a considerable amount of time to execute other tasks, the high -speed task is long until the low -speed task is completed.You will have to wait.This resultes a valuable computing resource into an idol state and reduces performance scaling ability.
RTOS may help you solve the problem, but not all problems can be solved.In the case of an SMP system, RTOS is essentially a prerequisite for task scheduling in the same core.The work to be performed can be divided for each data or function.When the work is divided by data chunk, each thread performs all steps in one pipeline that make up the processing.On the other hand, it is possible to execute one step of a function to one thread and execute another step for another thread.Which is more efficient depends on the characteristics of the execution work.
The most useful thing about debugging a multi -core system is to visualize all the cores.Ideally, you can start or stop multiple cores at the same time.In other words, one core can be executed in a single step while executing or stopping other cores.Multi -core breakpoints can be an extremely effective means when controlling other core movements based on a core condition.
It is not easy to perform multi -co -tracing.Managing high -band width trace information obtained from multiple cores, or handling different types of trace data from different types of core is a very difficult task.
Examples of processors with both heterogenia Smart core and homogenia smart core are implemented.There are two groups of Homojinia scores here, one is based on two ARM Cortec-A57, and the other is based on four CoreTex-A53.Each group is Homojinas, but the two groups are heterogeneous.
CoreSight debug architecture has a protocol and mechanism for communication with all core debugging sources.And the management of all information and the analysis of messages from different core play a debugger role.The Cross Trigger Interface (CTI) and Crostrigama Tricks (CTM) allows both core to be paused at the same time, trigger traces, and various operations.This trace infrastructure includes a serial tray sport (SWD) used to make the trace flow smooth, a parallel trail sport (TPIU), and a trace funnel that combines traces from each source into one flow.。The chip in the figure above shows that it is much more complicated and difficult to control than the dual core device.
The IAR EMBEDDED WORKBENCH C-SPY debugger supports both symmetrical multi-core debugging and asymmetric multi-core debugging.This enables using a multi -core tab debuga option.The only operations required to enable symmetry multi -core debugs are to input the number of cores into the debugger and inform them of the number of processors to communicate.You can use the same option with other IDEs.
On the right side of the figure above is the debugger view.In the figure, the core status of the Cortex-A9 SMP cluster composed of 4 cores is displayed, according to it..2 has stopped, but the other three cores are running.
Like the STM32H745/755, which uses CORETEX-M7 core and CoreTex-M4 core one by one, the asymmetrical multi-core system uses heterogenia Smartcore devices.In this case, two instances of IDE are used when the debugger is executed (master and partner).Since two cores execute a different project code, one instance is used per cores.
Each IDE instance has status information about the core that is currently under control and the core controlled by other windows.Debuga movements can be controlled by selecting options, so that developers can simultaneously or stop each core at the same time, as needed.
What makes such detailed controls is the ARM built -in chrostriga function consisting of the Cross Trigger Interface (CTI) and the Crostrigamatrix (CTM).There are three CTI components, one is for system level, one is for Cortex-M7 only, and the other one is for Cortex-M4 only.As shown in the figure below, these three CTIs are connected to each other via CTM.The CTI for the system-level CTI and the Cortex-M4 can access debugger through APB-D corresponding to the system access port and port.The CTI for Cortex-M7 is physically built into the Cortex-M7 core and can access debugger through Cortex-M7 accessports.
With CTI, you can trigger debugging and traces using events from various sources.For example, you can stop another processor by breakpoints detected in one of multiple processor core, or set the transition detected by external trigger input for the start of code trace.
This example is a heterozinia Smartcore processor, which is equipped with a Cortex-M7 core and one CORETEX-M4 core on one chip, but the two different programs are used here, one is on Cortex-M4.The other is executed on Cortex-M7.Each project manages these software executed on the processor by Freertos, and two cores communicate through shared memory interfaces.However, these applications both use the FREERTOS message passing mechanism to communicate with the other processor, so the complexity of the inner mechanism does not appear.Therefore, when viewed from one CPU, it is only transmitting a message with other tasks.Even if other tasks are executed on another CPU core, it is transparent.
The figure below is the IDE workspace Explorer window.The outline of the two projects is shown here, so you can check the contents of both the Cortex-M7 project and the Cortex-M4 project.
If you select a tab at the bottom of the window, you can switch focus to either the M4 project or the M7 project.
The Cortex-M7 project has a task that sends a message to the running task in Cortex-M4.Cortex-M4 has two instances of the inbox being executed.Cortex-M7 also has a "check" task that is periodically executed to confirm that various processing is performed correctly.
Eventually, the debugger loads both projects.This means that a new instance of Embedded Workbench will be started for another debugger.
To set up a debugger so that asymmetric multi -processing can be performed, it is necessary to specify one as a "master" project and the other as a "partner" project.In fact, this choice is optional, just decide which project can start the other project at startup.
There are no special settings in the "Partner" project, and we do not recognize whether it is working as "partner" of the other project.
In this way, when the "Master" project launches the debugger, another instance of the IDE is started automatically, providing another debugger session, and executing the second project in it.
Multi -core enables performance improvement even if Moore's law reaches its limit.However, multi -core also has debugging tasks, and it is necessary to have a multi -core -specific development method to make the application of multi -core architectures as much as possible.
Once the debug setup is completed, multi -core debugging will be as simple as ever.If you have used a single core tool in the past, you'll see that this tool contains everything you need.Also, you will find it difficult to understand those who are talking about how difficult multi -core debugging is.
The latest hardware tools and software tools will surely help solve multi -core debugging problems.