智财资本 Detailed Platform Analysis in RightMark Memory Analyzer Part 12 VIA C7C7-M Processors_搜狐网

Detailed Platform Analysis in RightMark Memory Analyzer. Part 12: VIA C7/C7-M Processors
A new generation of desktop/mobile processors VIA C7/C7-M was presented by VIA Technologies relatively recently — in the end of May – beginning of June 2005. Nevertheless, these processors appeared in ready solutions (notebooks, as well as pc1000/pc1500 platforms) much later. There is still no technical documentation on these processors (unlike the previous model — VIA C3), the main official information is published only on the manufacturer’s web site. So, the new generation of VIA C7/C7-M processors (they differ mostly in their titles and maximum clocks) uses the Esther core, based on the VIA CoolStream(tm) architecture. Philosophy of this architecture is based on three components — Secure by Design, Low Power by Design, and Performance by Design. The first component is provided by the traditional VIA PadLock module, built into the core, which was supplemented in these processors with SHA-1 and SHA-256 hashing and hardware support for Montgomery multiplier, used in RSA encryption algorithm, as well as NX (No Execute) bit. The second CoolStream component is achieved due to the 90nm SOI process technology from IBM. In this connection, power consumption of C7/C7-M processors amounts to just 12-20 Watts (depending on a clock — from 1.5 to 2.0 GHz, correspondingly). The third component (performance) is provided by the VIA StepAhead(tm) technology, consisting in VIA V4 bus (P4 Quad-Pumped bus counterpart) at up to 200 MHz (800 MHz quad-pumped), 16-stage pipeline, full-speed 128KB exclusive L2 Cache, and an improved branch prediction unit, typical of VIA. The second important factor of the third component is VIA TwinTurbo(tm), which allows a processor to switch between full-speed and power saving modes for one(!) CPU cycle thanks to two PLL units in a processor. Users of VIA C7/C7-M-based platforms cannot see the latter in action (to be more exact, VIA PowerSaver technology on the whole) — the manufacturer still hasn’t developed an official CPU driver, which would allow an operating system to manage power consumption modes of these processors. Fortunately, we have found out experimentally that these technologies are implemented much better compared to their earlier version in VIA C3 — in full compliance with the well-known, one can even say standard, Enhanced Intel SpeedStep technology. Nevertheless, this article is not devoted to these technologies, it’s about analyzing main low-level characteristics of the new Esther core in RightMark Memory Analyzer. We’ll compare these characteristics with what we previously got for the VIA Antaur platform (mobile modification of the VIA C3 second generation with Nehemiah core, which was the first to support SSE instructions). Testbed configurations Testbed 1 (Notebook MarcoPolo43T)
展开剩余94% CPU: VIA Antaur (Nehemiah core, 1.0 GHz) Chipset: VIA CLE266/VT8622 Memory: 256 MB Hyundai DDR-266, 2.5-3-3-7 timingsTestbed 2 (Notebook MaxSelect Optima C4)
CPU: VIA C7-M (Esther core, 1.5 GHz) Chipset: VIA CN700 (PM880/VT8235) Memory: 512 MB Hyundai DDR-400 in DDR-333, 3-3-3-7 timings Video: VIA/S3 Graphics UniChrome Pro IGP, 64MB UMA buffer BIOS: 4.06CJ15, 10/31/2005CPUID Characteristics We’ll start examining the Esther core by the example of the 1.5 GHz VIA C7-M processor from analyzing the main CPUID characteristics. Table 1. VIA Antaur (Nehemiah) CPUID
Table 2. VIA C7-M (Esther) CPUID
Esther differs from Nehemiah very little in the main ID code — processor’s signature: family number remains the same (6), only the model number was incremented by one (from 9 to 10). We shouldn’t take a new stepping number seriously, as there may be several models differing in this parameter. However, we shouldn’t have expected serious differences in family number (for example from Pentium II/III/M-like Family 6 to Pentium 4-like Family 15) — Esther can be considered an evolutionary step from Nehemiah core, like Yonah is an evolutionary step from Dothan. It is a direct analogy: Family Number remained the same (6), while Model Number was incremented by one (from 13 to 14), even despite dual cores in Yonah. At the same time, changes in the main low-level characteristics of Esther are quite serious. First of all it concerns L1-D/L1-I and L2 Caches of the processor — along with L2 Cache increased to 128 KB, its associativity has grown to 32(!), and lines sizes in each cache level have been increased to 64 bytes. Other changes include SSE2 and SSE3 instructions (the latter — without specific MONITOR and MWAIT instructions, which are relevant only for Intel processors with Hyper-Threading and/or multiple cores), as well as Thermal Monitor 1 (TM1), Thermal Monitor 2 (TM2), and Enhanced Intel SpeedStep (EIST). As we have already mentioned, the latter corresponds to the EIST-like implementation of the proprietary VIA PowerSaver technology. And finally, it’s written in CPUID of C7/C7-M that these processors support NX bit technologies, mentioned above when we described VIA PadLock. Real Bandwidth of Data Cache/Memory Let’s proceed to analyzing test results. As usual, we’ll start with the tests of real L1/L2 D-Cache and RAM bandwidth.
Picture 1. Real Bandwidth of L1/L2 Data Cache and MemoryBandwidth versus block size curves (Picture 1) look quite typical. We can say that the Esther core is characterized by exclusive L1/L2 Cache, L1 Cache size is 64 KB (the first inflection on read and write curves), and the total size of L1+L2 D-Caches is 192 KB (the second inflection). Table 3
Bandwidth ratings of all the three memory levels are published in Table 3. Compared to the previous solution from VIA — Nehemiah, Esther does not demonstrate evident advantages of its L1 D-Cache architecture. Its efficiency in reading data into MMX registers has grown. But at the same time, efficiency of write operations both from MMX as well as from SSE registers has gone down a little. In return, L1-L2 data bus has become more efficient (we are going to analyze it separately below), which manifests itself in higher L2 Cache efficiency both for reading and writing operations (in the latter case, L2 Cache bandwidth is even a tad higher than for reading, which is a rare phenomenon). Despite the faster 100 MHz V4 FSB (400 MHz Quad-Pumped), offering 3.2 GB/s theoretical bandwidth, as well as faster DDR-333 memory (maximum bandwidth – 2.67 GB/s), the real memory bandwidth for total reading (without optimizations) remains on a mediocre level – about 0.56 GB/s. At the same time, it’s nice mentioning that this platform offers significantly higher memory bandwidth for writing data into memory — up to 0.62 GB/s. It’s 3-4 times as high as on the earlier platform with a VIA Antaur processor. Maximum Real Memory Bandwidth Let’s evaluate the maximum real memory bandwidth using various optimizations. First of all, let’s analyze software prefetch, which does very well with reading from memory for most modern processors.
Picture 2. Maximum real memory bandwidth, Software Prefetch / Non-Temporal StoreThe curves of memory bandwidth versus PF distance (Picture 2) on a C7-M processor look noticeably different from the curves previously obtained on VIA C3/Antaur. In particular, the curves are characterized by a distinct PF efficiency maximum at the 192 byte distance (that is prefetching data 3 cache lines ahead of the current data), while software prefetch on C3/Antaur processors was not very efficient and memory bandwidth went down smoothly as the PF distance was increased. Maximum real memory bandwidth for this method (962 MB/s) is the absolute maximum — as you can see in Table 4, the other optimization methods are not so efficient. This parameter is also different from the previous generation of processors on Nehemiah core — in that case the maximum real memory bandwidth was achieved only using the “foul” method of reading whole cache lines (it’s “foul” because this method cannot read all data from memory, though data transfer along FSB is done full-scale).
Picture 3. Maximum real memory bandwidth, Block Prefetch 1 / Non-Temporal Store
Picture 4. Maximum real memory bandwidth, Block Prefetch 2 / Non-Temporal StoreWhat concerns Block Prefetch 1 and Block Prefetch 2 methods, the first one is initially designed for achieving maximum memory bandwidth on AMD K7 processors, the second method — for AMD K8, their behaviour on VIA C7-M (Pictures 2 and 3, correspondingly) is similar to what we saw on VIA C3/Antaur. Maximum efficiency of Method 1 is achieved at 64KB PF block, Method 2 — at 1KB (that is the second method is practically inefficient). And finally, the method of reading whole cache lines, which yielded the best results on VIA Antaur, is also quite efficient on C7-M (memory bandwidth — 910-920 MB/s). But its efficiency is a tad lower compared to software prefetch method. This layout of forces evidently favours the new Esther core, as Software Prefetch is the most frequently used, practically universal optimization method for reading data from memory. Table 4
*values relative to the maximum possible memory bandwidth for this memory type are given in parentheses What concerns optimization methods for writing data into memory, only one method is of real practical value — non-temporal store directly from MMX/SSE registers into memory via Write-Combining Buffers, avoiding the entire hierarchy of processor caches. It’s also an ultimate leader on the early Nehemiah as well as on the new Esther platform — memory bandwidth on the latter reaches about 1.72 GB/s (Table 5), that is nearly 65% of the theoretical maximum. The method of writing whole cache lines, which yielded a tad larger memory bandwidth on Nehemiah, gives a surprise here — its efficiency is lower than that of the method of writing data through the entire hierarchy of processor caches. We cannot explain such behaviour. But nevertheless, it is of no great practical use, as this method is purely synthetic and cannot be used for writing real data into memory. Table 5
*values relative to the maximum possible memory bandwidth for this memory type are given in parentheses Average Latency of Data Cache/Memory Before we proceed to analyzing latencies of various memory levels, we’d like to remind you that the most significant change in D-Cache and I-Cache in new C7/C7-M processors with Esther core was increasing their line length from 32 to 64 bytes. As this very value appears in all latency tests and is determined automatically when RMMA starts on a new unknown processor, let’s have a look at the curves of L1 and L2 D-Cache line sizes.
Picture 5. Determining L1 D-Cache line sizeSo, L1 D-Cache line size (Picture 5) does not admit of doubt — in all walk modes, maximum access latency increase in this test (which is a modification of the data arrival test) is demonstrated when reading a neighbouring element that is away from the main element by 64 bytes or more. Consequently, L1 Cache line size is indeed 64 bytes.
Picture 6. Determining L2 D-Cache line sizeQuite a clear picture, but with some interference from hardware prefetch (yep, it was implemented for the first time in VIA processors, we shall see it in more detail below) is also demonstrated in the second L2-RAM bus data arrival test. We can draw a conclusion from these curves that in all cases data are transferred from memory into L2 cache on the level of whole L2 Cache lines, which size is also 64 bytes. And now let’s proceed to L1/L2 D-Cache and RAM latencies as such.
Picture 7. L1/L2 D-Cache and RAM LatencyThese curves (Picture 7) look typical of processors with an exclusive cache hierarchy. The most significant difference of Esther from Nehemiah, which we have already seen above, is the implementation of (for the first time in VIA processors) hardware prefetch, which is efficient in case of forward sequential memory walk. For example, a similar picture could be seen on AMD K7 processors. Table 6
*4MB block size Quantitative ratings of L1/L2 Cache and RAM latencies are published in Table 6. L1 Cache latency in the new Esther core remains on the level of 6 cycles, which first appeared in Nehemiah core (the previous generation of VIA C3 processors with Ezra/Samuel cores used to have 4-cycle L1 Cache). Thus, VIA processors still possess the highest L1 D-Cache access latency. “Average” L2 D-Cache latency of the new Esther core has also grown (that is the latency in normal conditions, without unloading the bus) — it’s 4 cycles as high as in Nehemiah. Well, the changes are not unexpected, considering the significant changes in the cache structure – longer cache lines (64 bytes, which are more common now). RAM latency in the platform under review is very high — it’s even higher than in the previously reviewed platform with VIA Antaur. A partial remedy to this situation is hardware prefetch for forward sequential memory access, which reduces the latency to 120 ns, while its true latency is demonstrated during backward and pseudo-random walks – 194 ns. As we wrote many times, a further increase in latency in random walk mode has to do with depleting D-TLB, which in this case can hold only 128 pages, that is it can cover an area of 512 KB, no more. Minimum Latency of L2 D-Cache/Memory Let’s rate the minimum L2 D-Cache latency of VIA C7-M processors by unloading L1-L2 bus with empty operations (Picture 8).
Picture 8. Minimum L2 D-Cache LatencyFor unknown reasons (it happens on all VIA processors), a preliminary test of measuring the execution time of single NOP gives an excessive result (in this case — about 1.03 cycles instead of 1.00), which leads to a lowering baseline as the number of NOPs, inserted between cache access commands, grows. Nevertheless, the error is relatively small in this case. It allows to see that the minimum latency in all cases is achieved, when 12 NOPs are inserted – 17 cycles. The same value (for 8 NOPs or more inserted) could be also seen on VIA Antaur (see Table 7). Table 7
*4MB block size We shall rate the minimum memory latency using a similar test, increasing the block size to 4 MB.
Picture 9. Minimum RAM LatencyA picture of the unloaded L2-RAM bus (Picture 9) resembles that for AMD K7 processors, where hardware prefetch is also implemented for forward sequential walks only and is of relatively low efficiency — the unload curve goes down very smoothly and does not reach its minimum even in case of 64 NOPs inserted. Minimum memory latency at this point is 94 ns. In case of backward, pseudo-random, random walks, the curves have a typical “saw-tooth” look at 15 NOP steps, which correspond to the FSB frequency multiplier (100 MHz x 15 = 1.5 GHz). As it usually happens, minimum latencies in these modes are not much different from their mean values. Data Cache Associativity L2 Cache of the new Esther core has a very interesting peculiarity – very high 32-way associativity — VIA C7/C7-M models are currently the first processors with such a high-associativity cache (it’s not quite clear why, considering its small capacity – just 128 KB). Let’s try to determine L1/L2 D-Cache associativity of the processor under review using a standard test (Picture 10).
Picture 10. L1/L2 D-Cache AssociativityUnfortunately, this test does not give a clear picture — according to its results, we can say that L1 Cache associativity equals four (the first inflection on the graph); but as the number of segments grows, the picture gets blurry. But that’s not surprising — their maximum number in this test is limited to 32 (when this test was developed, there were no processors with such caches), while the second inflection in this case (exclusive organization) must be at the number of segments equal to the total associativity of L1+L2 Caches, that is 36. Well, we can only believe in 32-way associativity of L2 Cache in C7/C7-M processors. Later on we shall expand functionality of our test package. Real L1-L2 Cache Bus Bandwidth Considering the exclusive organization of L1-L2 D-Caches in Antaur and C7/C7-M processors, when each access to L2 Cache along with transferring data from L2 into L1 is accompanied by pushing the “sacrificed” line from L1 to L2, real L1-L2 bus bandwidth values, obtained in RMMA, were doubled (Table 8). Table 8
*taking into account the exclusive organization of cache As is known from our previous analysis of VIA processors, the L1-L2 data bus in VIA Antaur is characterized by relatively low bandwidth — just 2.56 bytes/cycle, which might indicate just a 32-bit organization. Nevertheless, the arrival test from that review showed that the bus capacity was actually 64 bit, as reading adjacent elements in a single 32-byte cache line was not accompanied by additional latencies. The L1-L2 bus got much faster in the new Esther core — up to 4.4 bytes/cycle (a tad faster for writing than for reading — the same picture could be seen in the very first test, when we examined L2 Cache bandwidth for reading and writing). That’s quite good actually, considering increased bandwidth requirements to this bus, imposed by the cache line increased to 64 bytes. At the same time, bus capacity still remains on the level of 64 bit, which is not that bad actually. The same situation (64-bit bus, 64-byte lines) can be seen in the AMD K7 architecture, for example. Let’s use the arrival test (Picture 11) to confirm our assumption about the 64-bit L1-L2 bus.
Picture 11. Data Arrival from L1-L2 Bus, Test 1This test reads two elements from the same cache line, the second element being at a specified distance (4-60 bytes) from the first one (beginning of the line). This test shows that only the first 48 bytes (from 0 to 47 inclusive) arrive from L2 cache for 6 cycles of L1 Cache access, while a request of the following 48th byte is accompanied by a noticeable latency growth. It means that the data transfer rate is indeed 48/6 = 8 bytes/cycle, that is the bus capacity is 64 bit. Besides, the second variation of the arrival test (Picture 12) allows to learn additional details about the data arrival order from L2 to L1.
Picture 11. Data Arrival from L1-L2 Bus, Test 2Offset of the first requested element from the start of the line is a variable is this test (from 0 to 60 bytes), while the second requested element is always shifted from the first one by -4 bytes (except for the initial point, when the offset equals -4 + 64 = +60 bytes, because both elements must be in the same cache line). The curves (Picture 12) show that data from L2 Cache of VIA C7/C7-M processors are read in 16-byte blocks. Data can be read from any 16-tuple position (maximum points on the curves): 1) 0-15, 16-31, 32-47, 48-63 2) 16-31, 32-47, 48-63, 0-15 3) 32-47, 48-63, 0-15, 16-31 4) 48-63, 0-15, 16-31, 32-47 A similar picture, but with 8-byte “granulation”, is demonstrated by the AMD K7/K8 architecture, that is D-Cache organization in VIA processors is getting increasingly similar to that in AMD processors. I-Cache, Decode/Execute Efficiency At first, let’s have a look at the situation with decoding/executing the simplest NOP instructions, because it’s quite different from the typical picture (Picture 13).
Picture 13. Decode/execute efficiency, NOP instructionsThat is there is no distinct area that corresponds to executing instructions from L2 Cache. It does not mean that L2 Cache of this processor “does not work” for caching code (instead of data), as decode/execute speed at 64-192 KB actually differs from that outside the total size of L1 and L2 caches. At the same time, the situation with decoding/executing other instructions, for example 6-byte compare operations cmp eax, xxxxxxxxh (CMP 3-6), is more typical (Picture 14).
Picture 14. Decode/execute efficiency, CMP instructionsLet’s proceed to quantitative ratings of decode/execute speed in Table 9. Table 9
What concerns decoding/executing the simplest ALU operations (independent as well as pseudo-dependent) from L1 Cache, there have been no changes here since the first VIA C3 processors (probably even since earlier VIA/Centaur processors). Maximum decode/execution speed of these instructions remains on the level of one instruction per cycle, which is too slow for these days. Decode/execute speed of CMP 2 (cmp ax, 0000h) and Prefixed CMP 1-4 (
[rep][addrovr]cmp eax, xxxxxxxxh) instructions is still lower compared to the other instructions. Their execution speed is slowed down as many times as there are prefixes together with the main operation. For example, the execution speed of CMP 2 is twice as low (1 prefix + 1 operation), while prefixed CMP instructions are executed three times as slow (2 prefixes + 1 operation). It means that like all previous VIA processors, processors on Esther core still spend their execution units on “executing” each prefix. Prefixed NOP Decode Efficiency test also corroborates this fact – it decodes/executes instructions of the following type: [66h]nNOP, n = 0..14 (Picture 15).
Picture 15. Decode/execute efficiency for prefix instructions NOPDecode/execute efficiency of such instructions, expressed in bytes/cycle, does not depend on a number of prefixes and always equals 1 byte/cycle. It means that the execution time of a single instruction, expressed in CPU cycles, indeed grows linearly together with the number of its prefixes. This approach is quite inefficient, considering that prefixes are not that rare in x86 code (especially considering SSE/SSE2/SSE3 instructions, now supported by Esther core). I-Cache Associativity
Picture 16. I-Cache AssociativityAs in case with L1/L2 D-Cache associativity, I-Cache associativity test (Picture 16) demonstrates only the official L1 I-Cache associativity (4). The second inflection area that corresponds to the total associativity of L1 I-Cache and shared L2 I/D-Cache (36), just does not fit this graph. Let’s hope to see it in future in a new RMMA version. Instruction Re-Order Buffer (I-ROB) Esther core demonstrates quite an interesting picture in I-ROB (Picture 17), which works like this: it runs one simple instruction that takes much time to execute (it uses an operation of dependent loading of a subsequent string from memory, mov eax, [eax], and right after it a series of very simple operations, which do not depend on the previous instruction (nop). Ideally, as soon as the execution time of this combo starts to depend on the number of NOPs, the I-ROB can be considered used up.
Picture 17. Instruction Re-Order Buffer SizeWhat’s interesting, execution time of this combo practically immediately starts depending on a number of NOPs. It means only one thing — VIA C7/C7-M processors lack the instruction re-order buffer, that is out-of-order code execution is out of the question for these processors. However, the lack of I-ROB blends well with the general picture of a simple CPU microarchitecture, like the above mentioned decoder and execution units. TLB Characteristics
Picture 18. D-TLB size
Picture 19. D-TLB associativityLike in case of Nehemiah core and unlike earlier models, tests of D-TLB size (Picture 18) and associativity (Picture 19) give no surprises. D-TLB size is indeed 128 entries (we saw it in CPUID characteristics as well as in the L1/L2/RAM latency test), the miss penalty (out of its limits) is very large — about 49 cycles. Associativity level is 8, associativity miss is accompanied by approximately the same penalty.
Picture 20. I-TLB size
Picture 21. I-TLB associativityThe above-said also applies to I-TLB tests — size (Picture 20) and associativity (Picture 21). I-TLB size is also 128 entries, its associativity level is 8. Besides, the initial area of the I-TLB size test allows to determine L1 I-Cache latency, that is execution time of a single unconditional short-range branch – 3 cycles. I-TLB miss penalty is more difficult to determine, because latency grows constantly together with the number of memory pages walked. But in both cases (size miss and associativity miss) we can determine an initial area, where latency grows to about 38-39 cycles, that is the minimum I-TLB miss penalty amounts to 35-36 processor cycles. Conclusion In microarchitectural terms, the new Esther core of VIA C7/C7-M processors is not a break-through or something cardinally new. It’s just a revision of the previous Nehemiah core in many respects — that core was used in desktop VIA C3 processors of the second generation and in mobile VIA Antaur processors. The most important differences of C7/C7-M from C3/Antaur apparent to the naked eye come down to support for SSE2 and SSE3 SIMD instructions, enlarging L2 Cache to 128 KB and raising its associativity level to 32, as well as (less evident for a common user) increasing line sizes of all caches to 64 bytes. While the memory system changes are quite successful — L1-L2 bus and L2 Cache bandwidths were increased, which in its turn allowed to achieve higher memory bandwidth and justified a faster V4 FSB (Pentium 4 Quad-Pumped bus counterpart); alas, we cannot say the same about computing units of the processors. VIA C7/C7-M processors still have a mediocre decoder, which cannot effectively process prefixed instructions — all SIMD instructions in particular. To all appearances, the number of execution unit was not changed either — at least even the simplest ALU operations are still executed at the speed of one operation per cycle. Thus, you can hardly expect high performance from the new VIA C7/C7-M processors. Even if we trust VIA advertising that these processors offer the best performance per Watt ratio compared to all other processors, it’s quite clear that it’s achieved solely due to their low power, not due to their high performance. So, the field of application of VIA processors is still limited to superlow-power solutions, which cannot offer high performance.
RightMark内存分析器中的详细平台分析。第12部分:威盛C7/C7-M处理器
威盛科技于2005年5月底至6月初推出了新一代桌面/移动处理器——威盛C7/C7-M。然而,搭载这些处理器的成品方案(包括笔记本电脑及PC1000/PC1500平台)上市时间明显滞后。目前官方尚未发布相关技术文档(与前代VIA C3型号不同),主要信息仅公布于制造商官网。
新一代VIA C7/C7-M处理器(主要区别在于型号名称和最高时钟频率)采用基于VIA CoolStream™架构的Esther核心。该架构理念基于三大核心要素——安全设计、低功耗设计与性能设计。
第一要素由内置于核心的传统VIA PadLock模块实现,该模块在本次处理器中新增了SHA-1与SHA-256哈希算法支持、RSA加密算法所需的蒙哥马利乘法器硬件支持,以及NX(禁用执行)位功能。
第二项CoolStream特性得益于IBM提供的90纳米SOI工艺技术。由此,C7/C7-M处理器的功耗仅为12-20瓦(具体取决于时钟频率——对应1.5至2.0 GHz范围)。
第三项性能要素由威盛StepAhead™技术提供,包含:最高200MHz(四倍速达800MHz)的威盛V4总线(对应P4四倍速总线)、16级流水线、全速128KB独立L2缓存,以及威盛特有的增强型分支预测单元。第三要素的第二关键技术是威盛双涡轮增压技术,该技术通过处理器内置的两个锁相环单元,使处理器能在单个CPU周期内切换全速运行与节能模式。基于威盛C7/C7-M平台的用户无法体验后者(更准确地说,是威盛PowerSaver技术整体)——制造商至今仍未开发官方CPU驱动程序,导致操作系统无法管理这些处理器的功耗模式。所幸实验表明,相较于VIA C3处理器中的早期版本,这些技术已实现显著优化——其运行机制完全符合广为人知的增强版英特尔速步技术(甚至可称之为行业标准)。
不过本文并非探讨这些技术,而是通过RightMark Memory Analyzer分析全新Esther核心的主要底层特性。我们将把这些特性与先前获得的VIA Antaur平台数据进行对比(该平台为第二代VIA C3处理器的移动版本,采用Nehemiah核心,是首个支持SSE指令集的处理器)。
测试平台配置
测试平台1(MarcoPolo43T笔记本)
CPU:VIA Antaur(尼希米核心,1.0 GHz)
芯片组:VIA CLE266/VT8622
内存:256 MB现代DDR-266,时序2.5-3-3-7
测试平台2(MaxSelect Optima C4笔记本)
CPU:威盛C7-M处理器(Esther核心,1.5 GHz)
芯片组:威盛CN700(PM880/VT8235)
内存:512 MB现代DDR-400内存(运行于DDR-333模式),时序3-3-3-7
显卡:威盛/S3 Graphics UniChrome Pro集成显卡,64MB UMA缓冲区
BIOS:4.06CJ15,2005年10月31日
CPUID特性
我们将通过分析1.5 GHz VIA C7-M处理器的主要CPUID特性,开始考察Esther核心。
表1. VIA Antaur(尼希米)CPUID参数
CPUID功能
数值
备注
处理器签名
698h
第6代,型号9,第8步进
缓存与TLB特性
40040120h
40040120h
00408120h
0880h
0880h
L1数据缓存:64KB,4路关联,32字节行
L1指令缓存:64KB,4路关联,32字节行
L2缓存:64KB,16路关联,32字节行
数据TLB:4KB页,8路关联,128条目
指令TLB:4KB页,8路关联,128条目
基本特性,EDX
(选定特性)
0381B93Fh
位23:MMX支持
位25:SSE支持
基本特性,ECX
00000000h
–
扩展特性,EDX
00000000h
–
表2. VIA C7-M (Esther) CPUID
CPUID 功能
值
注释
处理器签名
6A9h
第6代,型号10,步进9
缓存与TLB特性
40040140h
40040140h
0080A140h
0880h
0880h
L1-D 缓存:64 KB,4路关联,64字节行
L1-I 缓存:64 KB,4路关联,64字节行
L2 缓存:128 KB,32路关联,64字节行
D-TLB:4 KB页,8路关联,128条目
I-TLB:4 KB页,8路关联,128条目
基本特性,EDX
(选定特性)
A7C9BBFFh
位23:MMX支持
位25:SSE支持
位26:SSE2支持
位29:TM1支持
基本特性,ECX
00000181h
位0:SSE3支持
位7:EIST支持
位8:TM2支持
扩展特性,EDX
00100000h
位 20:NX位支持
Esther与Nehemiah在主ID代码(处理器签名)上差异甚微——家族编号保持不变(6),仅型号编号递增一位(从9到10)。我们不应过度关注新的步进编号,因为可能存在多个在此参数上不同的型号。但家族编号的重大变化本就不应出现(例如从奔腾II/III/M系列的家族6跳至奔腾4系列的家族15)——Esther可视为Nehemiah核心的进化步骤,如同Yonah之于Dothan。这恰是直接类比:家族编号保持不变(6),而型号编号仅递增一位(从13到14),即便约拿已采用双核架构。
与此同时,Esther在核心底层特性上发生了重大变革。首要变化体现在处理器的L1数据/指令缓存及L2缓存——不仅L2缓存容量提升至128KB,其关联度更跃升至32(!),且各缓存层的缓存行大小均扩大至64字节。其他变更包括SSE2和SSE3指令集(后者不含专属的MONITOR与MWAIT指令,该指令仅适用于支持超线程及/或多核的英特尔处理器),以及热监测器1(TM1)、热监测器2(TM2)和增强型英特尔速步技术(EIST)。如前所述,后者对应于威盛专有PowerSaver技术的EIST类实现方案。最后,C7/C7-M的CPUID信息明确标注这些处理器支持NX位技术——该特性在介绍威盛PadLock技术时已提及。
数据缓存/内存实际带宽
接下来分析测试结果。照例从L1/L2数据缓存及内存实际带宽测试开始。
图1. L1/L2数据缓存与内存实际带宽
带宽随块大小变化曲线(图1)呈现典型特征。由此可知,Esther内核采用独立L1/L2缓存架构,L1缓存容量为64KB(读写曲线首次拐点),L1+L2数据缓存总容量为192KB(第二次拐点)。
表3
级别
平均实际带宽(字节/周期)
VIA Antaur(Nehemiah)
威盛C7-M(Esther)
L1读取,MMX
L1读取,SSE
L1写入,MMX
L1写入,SSE
4.89
7.98
5.80
7.08
7.79
8.00
4.32
5.98
L2读取,MMX
L2读取,SSE
L2,写入,MMX
L2,写入,SSE
1.28
1.28
1.25
1.28
1.84
1.99
2.07
2.11
RAM,读取,MMX
RAM,读取,SSE
RAM,写入,MMX
RAM,写入,SSE
439 MB/s
433 MB/s
151 MB/s
201 MB/s
525 MB/s
564 MB/s
625 MB/s
622 MB/s
三层内存带宽数据详见表3。相较于威盛前代方案Nehemiah,Esther架构的L1数据缓存并未展现显著优势,但其将数据读入MMX寄存器的效率有所提升。但与此同时,无论是从MMX还是SSE寄存器发起的写操作效率均略有下降。作为补偿,L1-L2数据总线效率得到提升(下文将单独分析),这体现在L2缓存读写操作效率的双重提升上(后者带宽甚至略高于读取操作,实属罕见现象)。
尽管配备了更快的100 MHz V4前端总线(400 MHz四倍速)提供3.2 GB/s理论带宽,以及更快的DDR-333内存(最大带宽2.67 GB/s),但实际总读取带宽(未优化状态)仍维持在平庸水平——约0.56 GB/s。值得一提的是,该平台在内存写入方向展现出显著优势——最高可达0.62 GB/s,较搭载VIA Antaur处理器的旧平台提升3-4倍。
最大实际内存带宽
让我们通过多种优化方案评估最大实际内存带宽。首先分析软件预取技术——该技术在多数现代处理器中能显著提升内存读取效率。
图1. L1/L2数据缓存和内存带宽的实际带宽与块大小曲线(图1)呈现典型的特征。可以认为Esther内核具有以下特征:采用独立的L1/L2缓存,L1缓存容量为64KB(读写曲线首次拐点),L1+L2数据缓存总容量为192KB(第二次拐点)。
表3
级别
平均实际带宽,字节/周期
VIA Antaur(Nehemiah)
威盛C7-M(Esther)
L1读取,MMX
L1读取,SSE
L1写入,MMX
L1写入,SSE
4.89
7.98
5.80
7.08
7.79
8.00
4.32
5.98
L2读取,MMX
L2读取,SSE
L2,写入,MMX
L2,写入,SSE
1.28
1.28
1.25
1.28
1.84
1.99
2.07
2.11
RAM,读取,MMX
RAM,读取,SSE
RAM,写入,MMX
RAM,写入,SSE
439 MB/s
433 MB/s
151 MB/s
201 MB/s
525 MB/s
564 MB/s
625 MB/s
622 MB/s
三层内存带宽数据详见表3。相较于威盛前代方案Nehemiah,Esther架构的L1数据缓存并未展现显著优势,其MMX寄存器读取效率虽有提升。但与此同时,无论是从MMX还是SSE寄存器发起的写操作效率均略有下降。作为补偿,L1-L2数据总线效率得到提升(下文将单独分析),这体现在L2缓存读写操作效率的双重提升上(后者带宽甚至略高于读取操作,实属罕见现象)。
尽管配备了更快的100 MHz V4前端总线(400 MHz四倍速)提供3.2 GB/s理论带宽,以及更快的DDR-333内存(最大带宽2.67 GB/s),但实际总读取带宽(未优化状态)仍维持在平庸水平——约0.56 GB/s。值得一提的是,该平台在内存写入方向展现出显著优势——最高可达0.62 GB/s,较搭载威盛Antaur处理器的旧平台提升3-4倍。
最大实际内存带宽
让我们通过多种优化方案评估最大实际内存带宽。首先分析软件预取技术——该技术在多数现代处理器中能显著提升内存读取效率。
图2. 最大实际内存带宽,软件预取/非临时存储器在C7-M处理器上,内存带宽与预取距离(图2)的曲线明显不同于先前在威盛C3/Antaur上获得的曲线。尤其值得注意的是,曲线在 192 字节距离处呈现明显的 PF 效率峰值(即预取当前数据前 3 个缓存行的数据),而 C3/Antaur 处理器上的软件预取效率较低,随着 PF 距离的增加,内存带宽呈平稳下降趋势。
该方法实现的最大实际内存带宽(962 MB/s)是绝对极限值——如表4所示,其他优化方法效率远不及此。该参数也与基于尼希米核心的前代处理器存在差异——后者仅通过“违规”读取完整缓存行的方式才能达到最大实际内存带宽(此法被视为违规,因其无法读取全部内存数据,尽管前端总线数据传输仍按全速进行)。
图 3. 最大实际内存带宽,块预取 1 / 非临时存储
图 4. 最大实际内存带宽,块预取 2 / 非临时存储关于块预取 1 和块预取 2 方法,前者最初是为在 AMD K7 处理器上实现最大内存带宽而设计的,后者则是为 AMD K8 设计的,它们在 VIA C7-M 上的表现(分别见图 2 和图 3)与我们在 VIA C3/Antaur 上看到的相似。方法1在64KB预取块时效率最高,方法2则在1KB时效率最佳(即后者实际效率较低)。最后,在VIA Antaur上表现最佳的全缓存行读取方法在C7-M上同样高效(内存带宽达910-920 MB/s),但其效率略逊于软件预取方法。这种力量格局显然对新型Esther核心有利,因为软件预取作为从内存读取数据的最常用优化手段,几乎具有普适性。
表4
访问模式
最大实内存读取带宽,MB/s*
威盛安达尔(尼希米)
威盛C7-M(以斯帖)
读取,MMX
读取,SSE
读取,MMX,软件预取
读取,SSE,软件预取
读取,MMX,块预取1
读取,SSE,块预取1
读取,MMX,块预取 2
读取,SSE,块预取 2
读取缓存行,前向
读取缓存行,后向
439 (20.6 %)
433 (20.3 %)
334 (15.7 %)
485 (22.7 %)
524 (24.6 %)
537 (25.2 %)
539 (25.3 %)
609 (28.6 %)
660 (31.0 %)
660 (31.0 %)
525 (19.7 %)
564 (21.1 %)
911 (34.2 %)
962 (36.1 %)
843 (31.6 %)
844 (31.6 %)
785 (29.4 %)
839 (31.5 %)
920 (34.5 %)
910 (34.1 %)
*括号内数值为该内存类型理论最大带宽的相对值
关于内存写入优化方法,仅有一种真正具有实际价值——通过写合并缓冲区将MMX/SSE寄存器数据直接存储至内存,完全绕过处理器缓存层次结构。该方法在早期尼希米架构及新型以斯帖平台上均表现卓越——后者内存带宽达1.72 GB/s(表5),接近理论峰值的65%。在尼希米平台上略微提升内存带宽的整缓存行写入方法在此却令人意外——其效率竟低于通过处理器缓存层次结构逐级写入数据的方法。我们无法解释这种现象。但无论如何,该方法纯属合成测试,无法用于实际数据写入内存,故无实际应用价值。
表5
访问模式
最大实际内存写入带宽,MB/s*
VIA Antaur (尼希米)
VIA C7-M (以斯帖)
写入,MMX
写入,SSE
写入,MMX,非临时数据
写入,SSE,非临时数据
写入缓存行,正向
写入缓存行,反向
151 (7.1 %)
201 (9.4 %)
1046 (49.1 %)
1046 (49.1 %)
200.4 (9.4 %)
200.4 (9.4 %)
625 (23.4 %)
622 (23.3 %)
1721 (64.5 %)
1722 (64.5 %)
576 (21.6 %)
576 (21.6 %)
*括号内数值为该内存类型理论最大带宽的相对值
数据缓存/内存平均延迟
在分析各内存层级延迟前,需说明搭载Esther核心的新款C7/C7-M处理器中,数据缓存与指令缓存最显著的变更在于其行长从32字节扩展至64字节。由于该数值会出现在所有延迟测试中,且RMMA在检测未知处理器时会自动确定该值,下面我们来观察L1和L2数据缓存行大小的曲线。
图5. 确定L1数据缓存行大小因此L1数据缓存行大小(图5)毫无疑问——在所有访问模式下,此测试(数据到达测试的变体)中最大访问延迟的增长出现在读取与主元素相距64字节或更远的邻近元素时。由此可知L1缓存行大小确为64字节。
图6. 确定L2数据缓存行大小
第二个L2-RAM总线数据到达测试同样呈现清晰图景,但受到硬件预取机制的干扰(没错,该机制首次在威盛处理器中实现,我们将在下文详细探讨)。从这些曲线可得出结论:所有情况下数据都是以完整L2缓存行(同样为64字节)为单位从内存传输至L2缓存。
接下来让我们分析L1/L2数据缓存与RAM本身的延迟特性。
图7. L1/L2数据缓存与RAM延迟这些曲线(图7)呈现出具有排他性缓存层次结构的处理器典型特征。Esther与Nehemiah最显著的差异(如前所述)在于首次在威盛处理器中实现了硬件预取功能,该功能在正向顺序内存访问时表现高效。例如AMD K7处理器也呈现类似特性。
表6
缓存层级,访问模式
平均延迟(周期)
威盛Antaur(Nehemiah)
威盛C7-M(Esther)
L1,正向
L1,反向
L1,伪随机
L1,随机
6
6
–
6
6
6
6
6
L2,正向
L2,反向
L2,伪随机
L2,随机
25
25
–
25
29
29
29
29
RAM,正向
RAM,反向
RAM,伪随机*
RAM,随机*
126 (126 ns)
126 (126 ns)
–
195 (195 ns)
181 (120 ns)
294 (194 ns)
294 (194 ns)
379 (250 ns)
*4MB 块大小
(配备4周期L1缓存)。L1/L2缓存与内存延迟的量化评估详见表6。新型Esther核心的L1缓存延迟仍维持在6周期的水平,该特性首次出现在Nehemiah核心中(上一代采用Ezra/Samuel核心的VIA C3处理器使用...)。因此,VIA处理器仍保持着最高的L1数据缓存访问延迟。
新Esther核心的“平均”L2数据缓存延迟同样有所增长(即正常状态下未卸载总线时的延迟)——比Nehemiah高出4个周期。考虑到缓存结构的重大变化——更长的缓存行(64字节,现已更为普遍)——这些变化并不意外。
当前平台的内存延迟表现极差——甚至高于先前评测的威盛安达尔平台。针对前向顺序内存访问,硬件预取机制可将延迟降至120纳秒,但其真实延迟在后向及伪随机访问模式下暴露无遗——高达194纳秒。正如我们多次强调,随机访问模式下延迟的进一步增加源于D-TLB缓存容量不足——该缓存仅能容纳128个页面,覆盖范围仅限512KB区域。
L2数据缓存/内存最小延迟
通过向L1-L2总线发送空操作来评估VIA C7-M处理器的L2数据缓存最小延迟(图8)。
图8. L2数据缓存最小延迟由于未知原因(所有威盛处理器均存在此现象),单次NOP指令执行时间的初步测试结果存在偏差(本例中约1.03周期而非1.00周期),导致随着缓存访问指令间插入的NOP指令数量增加,基准值随之降低。不过此案例中的误差相对较小。由此可见,当插入12个NOP指令时可达到所有情况下的最小延迟——17个周期。在VIA Antaur处理器上(见表7),当插入8个或更多NOP指令时也能观察到相同数值。
表7
层级,访问类型
最小延迟,周期
VIA Antaur (Nehemiah)
VIA C7-M (Esther)
L2前向访问
L2后向访问
L2伪随机访问
L2随机访问
17
17
–
17
17
17
17
17
RAM前向访问
RAM后向访问
RAM伪随机访问*
RAM随机访问*
126 (126 ns)
126 (126 ns)
–
192 (192 ns)
142 (94 ns)
291 (192 ns)
289 (191 ns)
373 (246 ns)
*4MB 块大小
我们将采用类似测试评估最小内存延迟,并将块大小提升至4MB。
图9. 最小RAM延迟未加载的L2-RAM总线图(图9)与AMD K7处理器相似,其硬件预取功能同样仅对正向顺序遍历有效且效率较低——卸载曲线下降平缓,即使插入64个NOP指令也未达到最小值。此时最小内存延迟为94纳秒。在向后、伪随机及随机访问模式下,曲线在15个空操作步骤处呈现典型锯齿状特征,该数值对应前端总线频率倍增器(100 MHz × 15 = 1.5 GHz)。通常情况下,这些模式下的最小延迟值与平均值差异不大。
数据缓存关联性
新型Esther内核的L2缓存具有极具特色的设计——高达32路的关联性。VIA C7/C7-M系列处理器是目前首款采用如此高关联性缓存的处理器(考虑到其仅128KB的容量,此设计意图尚不明确)。
让我们通过标准测试(图10)尝试确定该处理器的L1/L2数据缓存关联性。
图10. L1/L2数据缓存关联性遗憾的是该测试未能给出明确结论——根据结果可推断L1缓存关联性为4(曲线图首处拐点);但随着分段数量增加,结论变得模糊。但这并不意外——该测试中分段数上限为32(开发时尚无配备此类缓存的处理器),而此处第二个拐点(排他性组织模式)应出现在分段数等于L1+L2缓存总关联度(即36)的节点。目前我们只能相信C7/C7-M处理器的L2缓存采用32路关联性。后续我们将扩展测试套件的功能。
实际L1-L2缓存总线带宽
鉴于Antaur与C7/C7-M处理器中L1-L2数据缓存采用排他组织结构,每次L2缓存访问及数据从L2传输至L1时,均需将“牺牲”线从L1推送至L2,因此RMMA测得的实际L1-L2总线带宽值呈现翻倍现象(表8)。
表8
访问模式
实际L1-L2带宽,字节/周期*
VIA Antaur (尼希米)
VIA C7-M (以斯帖)
读取(正向)
读取(反向)
写入(正向)
写入(反向)
2.56
2.56
2.56
2.56
4.32
4.32
4.40
4.40
*考虑缓存的排他性组织
根据我们对威盛处理器的先前分析,威盛安达尔处理器的L1-L2数据总线带宽相对较低——仅为2.56字节/周期,这可能表明其采用32位组织结构。然而,该评测中的到达测试表明总线实际容量为64位,因为在单条32字节缓存行中读取相邻元素时未产生额外延迟。
在新型Esther核心中,L1-L2总线速度显著提升至4.4字节/周期(写入速度略快于读取——这一现象在最初测试L2缓存读写带宽时已有所体现)。考虑到缓存行增至64字节导致的带宽需求增长,这一表现实属优异。同时总线容量仍维持在64位水平,这其实并不算差。类似配置(64位总线搭配64字节缓存行)在AMD K7架构中亦有体现。
让我们通过数据到达测试(图11)验证64位L1-L2总线的假设。
图 11. L1-L2 总线数据到达测试 1
该测试从同一缓存行读取两个元素,第二个元素与第一个元素(缓存行起始位置)间距固定(4-60 字节)。该测试表明:在6个L1缓存访问周期内,仅前48字节(0至47字节)来自L2缓存;而请求第48字节时会出现显著延迟增长。这证实数据传输速率确为48/6=8字节/周期,即总线容量为64位。
此外,第二种数据到达测试变体(图12)可进一步揭示L2至L1的数据到达顺序细节。
图11. L1-L2总线数据到达测试2本次测试中,首个请求元素相对于缓存行起始位置的偏移量为变量(0至60字节),而第二个请求元素始终相对于首个元素偏移-4字节(初始点除外,此时偏移量为-4 + 64 = +60字节,因两个元素必须位于同一缓存行)。
曲线(图12)显示VIA C7/C7-M处理器的L2缓存数据以16字节块为单位读取。数据可从任意16元组位置读取(曲线上的最大点):
1) 0-15, 16-31, 32-47, 48-63
2) 16-31, 32-47, 48-63, 0-15
3) 32-47, 48-63, 0-15, 16-31
4) 48-63, 0-15, 16-31, 32-47
AMD K7/K8架构也呈现类似特征,但采用8字节粒度。这表明威盛处理器的数据缓存组织正日益趋近于AMD处理器。
指令缓存,解码/执行效率
首先观察最简单的NOP指令解码/执行情况,其表现与典型模式存在显著差异(图13)。
图13. NOP指令的解码/执行效率这表明不存在专门用于执行L2缓存指令的独立区域。但这并不意味着该处理器的L2缓存无法缓存代码(而非数据),因为在64-192KB范围内,其解码/执行速度确实与L1+L2缓存总容量之外的情况存在差异。
同时,其他指令(如6字节比较操作cmp eax, xxxxxxxxh(CMP 3-6))的解码/执行情况更具代表性(图14)。
图14. 指令解码/执行效率(CMP指令)
让我们通过表9进行解码/执行速度的定量评估。
表9
指令类型(字节大小)
解码/执行效率(字节/周期)(指令/周期)
VIA Antaur(Nehemiah)
VIA C7-M(Esther)
L1-I 缓存
L2 缓存
L1-I 缓存
L2 缓存
NOP (1)
1.00 (1.00)
1.00 (1.00)
1.00 (1.00)
~0.84 (0.84)
SUB (2)
2.00 (1.00)
1.14 (0.57)
2.00 (1.00)
0.93 (0.46)
异或 (2)
2.00 (1.00)
1.14 (0.57)
2.00 (1.00)
0.93 (0.46)
测试 (2)
2.00 (1.00)
1.14 (0.57)
2.00 (1.00)
0.93 (0.46)
异或/加法 (2)
2.00 (1.00)
1.14 (0.57)
2.00 (1.00)
0.93 (0.46)
比较 1 (2)
2.00 (1.00)
1.14 (0.57)
2.00 (1.00)
0.93 (0.46)
比较 2 (4)
2.00 (0.50)
1.14 (0.28)
2.00 (0.50)
0.93 (0.23)
CMP 3-6 (6)
5.99 (1.00)
1.14 (0.19)
5.99 (1.00)
1.00 (0.16)
前缀CMP 1-4 (8)
2.67 (0.33)
1.14 (0.14)
2.67 (0.33)
0.96 (0.12)
关于从L1缓存解码/执行最简单的算术逻辑单元操作(包括独立操作和伪相关操作),自初代威盛C3处理器(甚至更早的威盛/Centaur处理器)以来此机制未曾改变。这些指令的最大解码/执行速度仍维持在每周期一条指令的水平,以当今标准而言过于缓慢。
CMP 2指令(cmp ax, 0000h)及带前缀的CMP 1-4指令(
[rep][addrovr]cmp eax, xxxxxxxxh)的解码/执行速度仍低于其他指令。其执行速度会随前缀数量与主操作的叠加而呈倍数级下降。例如CMP 2指令执行速度减半(1个前缀+1次操作),而带前缀的CMP指令执行速度减至三分之一(2个前缀+1次操作)。这表明与所有早期VIA处理器相同,基于Esther内核的处理器仍需消耗执行单元来“执行”每个前缀指令。
前缀NOP指令的解码效率测试印证了这一现象——该测试针对[66h]nNOP(n=0..14)类指令进行解码/执行(图15)。
图 15. 前缀指令的解码/执行效率
此类指令的解码/执行效率(以字节/周期表示)与前缀数量无关,始终等于 1 字节/周期。这意味着单条指令的执行时间(以 CPU 周期表示)确实会随前缀数量呈线性增长。考虑到前缀在 x86 代码中并不罕见(尤其是现在 Esther 核心支持的 SSE/SSE2/SSE3 指令),这种方法效率相当低。
指令缓存关联度
图 16. 指令缓存关联度
与 L1/L2 数据缓存关联度测试类似,指令缓存关联度测试(图 16)仅展示官方 L1 指令缓存的 4 级关联度。第二个弯曲区域对应于L1指令缓存和共享L2指令/数据缓存的总关联性(36),但该区域并未出现在此图表中。期待未来新版RMMA能呈现该区域。
指令重排序缓冲区(I-ROB)
Esther核心在指令重排缓冲区(I-ROB)中呈现出颇为有趣的特性(图17),其工作机制如下:它会执行一条耗时较长的简单指令(使用依赖加载后续字符串的操作mov eax,[eax]),紧接着执行一系列不依赖前一条指令的简单操作(nop)。理论上,当该组合指令的执行时间开始取决于NOP数量时,即可视为指令重排序缓冲区已耗尽。
图17. 指令重排序缓冲区容量值得注意的是,该组合指令的执行时间几乎立即开始受NOP数量影响。这意味着唯一个结论——VIA C7/C7-M处理器缺乏指令重排序缓冲区,即这些处理器无法实现乱序执行。然而,指令重排序缓冲区的缺失与整体简单CPU微架构特征相吻合,如前述解码器和执行单元的设计。
TLB特性
图18. D-TLB容量
图19. D-TLB关联度与Nehemiah核心类似(但不同于早期型号),D-TLB容量(图18)和关联度(图19)的测试结果毫无意外。D-TLB 确实拥有 128 个条目(我们在 CPUID 特征及 L1/L2/RAM 延迟测试中均已验证),其缺失惩罚(超出其限制)相当高——约 49 个时钟周期。关联性级别为 8,缺失时伴随的惩罚大致相同。
图 20. I-TLB 缓存大小
图21. 指令缓存查找表(I-TLB)的关联性上述内容同样适用于指令缓存查找表(I-TLB)的测试——缓存大小(图20)和关联性(图21)。指令缓存大小同样为128条目,其关联度为8级。此外,指令缓存大小测试的初始区域可用于确定L1指令缓存延迟,即单次无条件短距离分支的执行时间——3个时钟周期。指令缓存缺失惩罚更难确定,因为延迟会随访问的内存页数持续增长。但在两种缺失情形(容量缺失与关联性缺失)中,我们均可确定延迟增长至约38-39周期的初始区域,即I-TLB缺失惩罚的最小值为35-36个处理器周期。
结论
从微架构角度看,威盛C7/C7-M处理器的新型Esther核心并非突破性创新。它在多方面只是对前代Nehemiah核心的改进——该核心曾应用于第二代桌面级VIA C3处理器及移动端VIA Antaur处理器。C7/C7-M相较C3/Antaur最显著的差异在于:支持SSE2与SSE3 SIMD指令集,将L2缓存扩容至128KB并提升至32级关联度,以及(对普通用户较不明显)所有缓存行长均增至64字节。虽然内存系统改进相当成功——L1-L2总线及L2缓存带宽均获提升,从而实现了更高的内存带宽并支撑了更快的V4前端总线(相当于奔腾4的四倍时钟总线);但处理器的计算单元却未能达到同等水平。威盛C7/C7-M处理器仍采用性能平庸的指令解码器,无法高效处理前缀指令——尤其是所有SIMD指令。从现有证据看,其执行单元数量也未增加——即便最简单的算术逻辑单元运算仍维持每周期单指令的执行速度。因此,新推出的C7/C7-M处理器难以期待其具备卓越性能。即便采信威盛宣称其处理器拥有业界最佳的每瓦性能比,显然这完全源于低功耗特性而非高性能表现。因此威盛处理器的应用领域仍局限于超低功耗解决方案,无法提供高性能支持。
发布于:辽宁省创通网配资提示:文章来自网络,不代表本站观点。





