|
GTPin
|
The Itrace tool generates a dynamic control-flow trace of kernel execution
The trace is provided for each kernel, for each Draw/Enqueue granularity, and for each separate HW thread.
The Itrace tool (as well as all GTPin tracing tools) works in two phases, which should be run separately:
To run the pre-processing phase of the itrace tool (in its default configuration) use the following command:
Profilers/Bin/gtpin -t itrace --phase 1 -- app
NOTE: You can run this phase only once per application.
To run the trace-gathering phase of the itrace tool (in its default configuration), use the following command:
Profilers/Bin/gtpin -t itrace --phase 2 -- app
When you run the GTPin in-house Itrace tool in a default configuration for pre-processing (phase 1) GTPin generates the directory GTPIN_PROFILE_ITRACE0. In addition, the following two files are created within the current directory:
itrace_pre_process.txt: Refers to a buffer for kernels dispatched to be executed on the device. The itrace_pre_process.txt file contains the maximum number of trace records generated by each kernel, out of all instances of Draw/Enqueue commands that this kernel executed (on the device). For example, if the kernel generates between 5 and 15 records when executed, the allocated buffer for that kernel should be large enough to hold 15 records.This file is an input to the trace gathering phase. It has the following format:
VS_asm7cf5b819f3a88d8e_simd8___VS_asm7cf5b819f3a88d8e_simd8_10 2048 CS_asm3f2b2787381d7600_simd8___CS_asm3f2b2787381d7600_simd8_28 2048 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 4326 PS_asm967d2b3a9b5e0245_simd8___PS_asm967d2b3a9b5e0245_simd8_35 17888 PS_asm967d2b3a9b5e0245_simd16___PS_asm967d2b3a9b5e0245_simd16_37 95844 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_40 4326
where the numerical field indicates the number of required trace records, and the string field indicates a kernel.
itrace_pre_process_sw_threads.txt: Refers to a information about the SW threads created by each kernel when the kernel is executed on the device. Each line in the .txt file contains the name of the kernel executed on the device; the number of SW threads generated by that kernel’s execution of a Draw or Enqueue command; the ID of the Draw or Enqueue command; and some other metadata.This file contains informational data only, and has the following format:
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 101 DX12 0 34 318 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 99 DX12 0 34 319 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 28 DX12 0 34 320 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 38 DX12 0 34 321 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 76 DX12 0 34 322 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 22 DX12 0 34 323 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 13 DX12 0 34 324 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 104 DX12 0 34 325 3 0 VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 97 DX12 0 34 326 3 0
where each line corresponds to a single kernel for a single Draw/Enqueue command. The fields have the following meaning (from left to right):
When the Itrace tool is run for trace gathering (phase 2), the tool generates a directory: GTPIN_PROFILE_ITRACE1. GTPin saves the profiling results within the folder: GTPIN_PROFILE_ITRACE1\Session_Final. The traces for each kernel are saved in a separate sub-folder that has the same name as the kernel. Each Draw/Enqueue command has a separate trace, which is saved in a corresponding sub-directory, as shown in the following screenshot:
Each trace is saved in a compressed binary format, within a file called itrace_compressed.bin, as shown above. To uncompress the trace, you must run a Profilers\Scripts\uncompress_itrace.py Python Software Foundation Python* script, in the following manner (Python 3.5 or above is required):
python3 Profilers\Scripts\uncompress_itrace.py --input_dir GTPIN_PROFILE_ITRACE1\Session_Final\PS_asm967d2b3a9b5e0245_simd16\device_0__cl_34__draw_0__pso_3 --gen 9
Running this script opens the compressed trace into separate traces for each HW thread each, as shown in the following screenshot:
where the trace generated on each HW thread is saved in a text file named itrace___s_0_ss_0_eu_1_tid_5.out. The file name indicates the HW thread topology ID (Slice (S), DualSubSlice (DSS), SubSlice (SS), Execution Unit (EU), and HW thread ID (TID)). The resulting trace is provided in the following format:
BBL ID INS OFFSET
===============================
0 0x00
0 0xc8
16 0x9c0
16 0xa78
17 0xa88
17 0xae8
19 0xb58
20 0xbd8
21 0xbe8
22 0xcc8
23 0xcd8
24 0xdb8
25 0xdc8
26 0xdd0
--- EOT ---
0 0x00
0 0xc8
16 0x9c0
16 0xa78
17 0xa88
17 0xae8
19 0xb58
20 0xbd8
21 0xbe8
22 0xcc8
23 0xcd8
24 0xdb8
25 0xdc8
26 0xdd0
--- EOT ---
The left column indicates the basic block (BBL) ID, where the control flow goes. The right column indicates the offset of the first instruction of this BBL. In the case that a BBL consists of plural instructions, where the last one is a control flow instruction, then two lines are provided for the same BBL: One line for its first instruction, and one line for its last instruction. If the last instruction of a BBL is a control flow instruction that passes the control to the first instruction of the same basic block (single BBL loop), then the number of sequential repetitions of this BBL within the control flow is indicated.
An EOT indication separates the consequent dispatches of different SW threads of the kernel from the same HW thread.
To map the basic block IDs and instruction offsets to the kernel code, you must look in the Session_Final\ISA sub-folder, where the GEN assembly of all kernels are saved.
In addition to the trace, GTPin dumps the control-flow graph of the kernel into a text file called itrace_total.cfg. This file represents the CFG graph as a list of its edges, along with the frequencies of the control-flow transition on each edge, as accumulated over all HW threads traces. The itrace_total.cfg file has the following format:
srcBBL, dstBBL, Frequency ========================= 0,1,0 0,16,262144 1,2,0 1,9,0 2,3,0 3,4,0 3,6,0 3,7,0 4,5,0 5,6,0 5,7,0 6,7,0 7,8,0 8,9,0 8,15,0 9,10,0 10,11,0 10,13,0 10,14,0 11,12,0 12,13,0 12,14,0 13,14,0 14,15,0 15,16,0 15,25,0 16,17,262144 16,18,0 17,18,0 17,19,262144 18,19,0 19,20,262144 20,21,262144 20,23,0 20,24,0 21,22,262144 22,23,262144 22,24,0 23,24,262144 24,25,262144 25,26,262144
where each line represents separate CFG edge and contains the following values:
To map the basic block IDs and instruction offsets to the kernel code, you must look in the Session_Final\ISA sub-folder, where the GEN assembly of all kernels are saved.
(Back to the list of all GTPin Sample Tools)
00001 /*========================== begin_copyright_notice ============================ 00002 Copyright (C) 2018-2023 Intel Corporation 00003 00004 SPDX-License-Identifier: MIT 00005 ============================= end_copyright_notice ===========================*/ 00006 00007 /*! 00008 * @file Itrace tool definitions 00009 */ 00010 00011 #ifndef ITRACE_H_ 00012 #define ITRACE_H_ 00013 00014 #include <list> 00015 #include <map> 00016 #include <vector> 00017 #include <set> 00018 00019 #include "gtpin_api.h" 00020 #include "kernel_weight.h" 00021 00022 using namespace gtpin; 00023 00024 /* ============================================================================================= */ 00025 // Struct ItraceRecord 00026 /* ============================================================================================= */ 00027 /*! 00028 * Structure of the trace record header. 00029 * The header details architectural state during execution of a BBL. 00030 */ 00031 struct ItraceRecord 00032 { 00033 uint16_t bblId; ///< BBL identifier 00034 uint16_t sr0; ///< LSB-16 of the State register sr0.0:ud 00035 uint32_t tileId; ///< TileId 00036 }; 00037 00038 /* ============================================================================================= */ 00039 // Class ItraceDispatch 00040 /* ============================================================================================= */ 00041 /*! 00042 * Class that holds trace collected during a single kernel dispatch 00043 */ 00044 class ItraceDispatch 00045 { 00046 public: 00047 /// Construct a ItraceDispatch object with the empty trace 00048 explicit ItraceDispatch(const IGtKernelDispatch& dispatch) : _isTrimmed(false) { dispatch.GetExecDescriptor(_kernelExecDesc); } 00049 00050 /// Read the entire trace from the specified profile buffer into this object 00051 bool ReadTrace(const GtProfileTrace& traceAccessor, const IGtProfileBuffer& profileBuffer); 00052 00053 const GtKernelExecDesc& KernelExecDesc() const { return _kernelExecDesc; } ///< @return Descriptor of this kernel dispatch 00054 uint32_t Size() const { return (uint32_t)_rawTrace.size(); } ///< @return Trace size in bytes 00055 const uint8_t* Data() const { return _rawTrace.data(); } ///< @return Trace data collected in this dispatch 00056 uint8_t* Data() { return _rawTrace.data(); } ///< @return Trace data collected in this dispatch 00057 bool IsEmpty() const; ///< @return true if the trace is empty 00058 bool IsTrimmed() const { return _isTrimmed; } ///< @return true if the trace has been trimmed 00059 00060 private: 00061 GtKernelExecDesc _kernelExecDesc; ///< Kernel execution descriptor 00062 std::vector<uint8_t> _rawTrace; ///< Trace data collected in this kernel dispatch 00063 bool _isTrimmed; ///< true if the trace has been trimmed to avoid buffer overflow 00064 }; 00065 00066 /* ============================================================================================= */ 00067 // Class ItraceKernel 00068 /* ============================================================================================= */ 00069 /*! 00070 * Class that contains 00071 * - Static information about basic block offsets in the kernel 00072 * - Collection of instruction traces recorded by kernel dispatches 00073 */ 00074 class ItraceKernel 00075 { 00076 public: 00077 ItraceKernel() = default; 00078 00079 /// Construct a ItraceKernel object intended to hold traces of the specified kernel 00080 explicit ItraceKernel(const IGtKernelInstrument& kernelInstrument, uint32_t numTiles); 00081 00082 /*! 00083 * Read a trace recorded by the specified kernel dispatch. Create and add the corresponding ItraceDispatch 00084 * instance to this object 00085 */ 00086 ItraceDispatch& AddItrace(IGtKernelDispatch& kernelDispatch); 00087 00088 std::string Name() const { return _name; } ///< @return Kernel's name 00089 std::string ExtendedName() const { return _extName; } ///< @return Kernel's extended name 00090 std::string UniqueName() const { return _uniqueName; } ///< @return Kernel's unique name 00091 const GtGpuPlatform Platform() const { return _platform; } ///< @return Kernel's platform 00092 const IGtGenModel& GenModel() const { return GetGenModel(_genId); } ///< @return Kernel's GEN model 00093 const GtProfileTrace& TraceAccessor() const { return _traceAccessor; } ///< @return Trace accessor 00094 void DumpAsm() const; ///< Dump kernel's assembly text to file 00095 00096 /// @return true, if tracing of this kernel is enabled 00097 uint32_t IsEnabled() const { return (_traceAccessor.MaxTraceSize() != 0); } 00098 00099 /// @return Control-Flow Graph edges 00100 typedef std::pair<BblId, BblId> Edge; 00101 typedef std::set<Edge> Edges; 00102 const Edges& GetEdges() const { return _edges; } 00103 00104 /// @return Traces collected in kernel's dispatches 00105 typedef std::list<ItraceDispatch> Traces; 00106 const Traces& GetTraces() const { return _traces; } 00107 00108 typedef std::pair<ImgOffset, ImgOffset> BblBounds; ///< Basic block's head and tail offsets 00109 typedef std::map<BblId, BblBounds> BblBoundsMap; ///< Basic Block Id to Offsets map 00110 /// @return Basic Block to Offset map 00111 const BblBoundsMap& GetBblBounds() const { return _bblBoundsMap; } 00112 00113 /// @return Number of tiles 00114 uint32_t NumTiles() const { return _numTiles; } 00115 00116 private: 00117 std::string _name; ///< Kernel's name 00118 std::string _uniqueName; ///< Kernel's unique name 00119 std::string _extName; ///< Kernel's extended name 00120 GtGpuPlatform _platform; ///< Kernel's platform 00121 GtGenModelId _genId; ///< Identifier of the GEN model, the kernel is compiled for 00122 std::string _asmText; ///< Kernel's assembly text 00123 Edges _edges; ///< Kernel's set of control-flow graph's edges 00124 GtProfileTrace _traceAccessor; ///< Trace accessor 00125 Traces _traces; ///< Traces collected in kernel's dispatches 00126 BblBoundsMap _bblBoundsMap; ///< Basic Block to BBlBounds map 00127 uint32_t _numTiles; ///< The number of supported tiles 00128 }; 00129 00130 /* ============================================================================================= */ 00131 // Class Itrace 00132 /* ============================================================================================= */ 00133 /*! 00134 * Implementation of the IGtTool interface for the Itrace tool 00135 */ 00136 class Itrace : public GtTool 00137 { 00138 public: 00139 /// Implementation of the IGtTool interface 00140 const char* Name() const { return "Itrace"; } 00141 uint32_t ApiVersion() const { return GTPIN_API_VERSION; } 00142 00143 void OnKernelBuild(IGtKernelInstrument& instrumentor); 00144 void OnKernelRun(IGtKernelDispatch& dispatcher); 00145 void OnKernelComplete(IGtKernelDispatch& dispatcher); 00146 00147 public: 00148 static void OnFini(); ///< Callback function registered with atexit() 00149 00150 static Itrace* Instance(); ///< @return Single instance of this class 00151 00152 private: 00153 Itrace() {} 00154 Itrace(const Itrace&) = delete; 00155 Itrace& operator = (const Itrace&) = delete; 00156 ~Itrace() = default; 00157 00158 /*! 00159 * Generate instrumentation for the specified basic block 00160 * @param[in] instrumentor Interface of the kernel being instrumented 00161 * @param[in] bbl Basic block to be instrumented 00162 * @param[in] ItraceKernel Object that holds information about basic blocks in the kernel 00163 * @return success/failure status 00164 */ 00165 bool InstrumentBbl(IGtKernelInstrument& instrumentor, const IGtBbl& bbl, const ItraceKernel& ItraceKernel); 00166 00167 /*! 00168 * Generate procedure that allocates space for the new trace record in the trace and stores the record 00169 * for the specified basic block. The procedure sets _offsetReg equal to the offset of the location within the 00170 * profile buffer immediately following the record header. 00171 * If new record can not be allocated due to the trace capacity limitations, the procedure zeroes _offsetReg. 00172 * 00173 * @param[in, out] proc Procedure, the generated code is appended to 00174 * @param[in] coder GEN code generator 00175 * @param[in] bbl Basic block being instrumented 00176 * @param[in] ItraceKernel Object that holds information about basic blocks in the kernel 00177 * @param[in] recordSize Size of the new record, in bytes 00178 */ 00179 void StoreRecord(GtGenProcedure& proc, const IGtGenCoder& coder, const IGtBbl& bbl, 00180 const ItraceKernel& ItraceKernel, uint32_t recordSize); 00181 00182 private: 00183 std::map<GtKernelId, ItraceKernel> _kernels; ///< Collection of traces per kernel 00184 00185 GtReg _addrReg; ///< Virtual register that holds address within profile buffer 00186 GtReg _dataReg; ///< Virtual register that holds data to be read from/written to profile buffer 00187 GtReg _offsetReg; ///< Virtual register that holds offset within the trace 00188 GtReg _tileIdReg; ///< Virtual register that holds tile ID 00189 }; 00190 00191 /* ============================================================================================= */ 00192 // Class ItracePreProcessor 00193 /* ============================================================================================= */ 00194 /*! 00195 * Class that computes per-kernel trace sizes in the preprocessing phase, and provides access to 00196 * this data in the trace gathering phase 00197 */ 00198 class ItracePreProcessor : public KernelWeight 00199 { 00200 public: 00201 uint64_t TraceSize(const std::string& extKernelName) const; ///< Given extended kernel name, return the trace size in bytes 00202 static ItracePreProcessor* Instance(); ///< @return Single instance of this class 00203 static void OnFini(); ///< Callback function registered with atexit() 00204 00205 private: 00206 ItracePreProcessor(); 00207 ItracePreProcessor(const ItracePreProcessor&) = delete; 00208 ItracePreProcessor& operator = (const ItracePreProcessor&) = delete; 00209 00210 private: 00211 /// KernelWeight interface overrides (@see description of KernelWeight functions) 00212 uint32_t GetBblWeight(IGtKernelInstrument& kernelInstrument, const IGtBbl& bbl) const; 00213 void AggregateDispatchCounters(KernelWeightCounters& kc, KernelWeightCounters dc) const; 00214 00215 private: 00216 KernelWeightProfileData _kernelCounters; ///< Per-kernel counters of required trace records; collected in preprocessing phase 00217 static const char* _kernelPreProcessFileName; ///< Name of the file that contains preprocessing data per kernel 00218 static const char* _dispatchPreProcessFileName; ///< Name of the file that contains preprocessing data per kernel dispatch 00219 }; 00220 00221 /* ============================================================================================= */ 00222 // Class ItracePostProcessor 00223 /* ============================================================================================= */ 00224 /*! 00225 * Function object that processes kernel traces - stores them in files within the profile directory: 00226 * 00227 * kernel_name 00228 * | 00229 * |- kernel_dispatch_1 00230 * |- itrace_compressed.bin 00231 * |- itrace_total.cfg 00232 * |- kernel_dispatch_2 00233 * |- itrace_compressed.bin 00234 * |- itrace_total.cfg 00235 * The .bin trace files can be uncompressed by the uncompress_itrace.exe utility. 00236 * 00237 * Format of .bin trace files: 00238 * - Static information: 00239 * - Number of BBLs in the kernel 00240 * - For each BBL: 00241 * - BBL ID 00242 * - Offset: 00243 * value = 0xFFFFXXXX for BBLs which do not end with control flow instruction (XXXX == offset of the BBL within the original binary) 00244 * value = 0xYYYYXXXX for BBLs which end with control flow instruction (XXXX == offset of the BBL within the original binary, YYYY == the offset of the control flow instruction) 00245 * - Dynamic trace data: 00246 * - Number of HW threads in which the trace was collected 00247 * - For each HW thread: 00248 * - HW Thread ID (in the format of sr0.0) 00249 * - Number of records collected for this HW thread 00250 * - All the records collected for this HW thread 00251 * 00252 * The format of *.cfg files: 00253 * Each line related to a separate edge of the control-flow graph: 00254 * - source BBL ID, destination BBL ID, frequency of visiting the edge between source and destination BBLs. 00255 */ 00256 class ItracePostProcessor 00257 { 00258 public: 00259 /// Construct a ItracePostProcessor object for the specified collection of kernel traces 00260 ItracePostProcessor(const IGtCore& gtpinCore, const ItraceKernel& ItraceKernel); 00261 00262 /// Process all kernel traces associated with this object - store them in files within the profile directory 00263 bool operator()(); 00264 00265 private: 00266 00267 /// Process the specified trace 00268 void ProcessTrace(const ItraceDispatch& trace); 00269 00270 /// Store the processed trace in the specified file stream 00271 void StoreTrace(std::ofstream& fs); 00272 00273 /// Store static information about BBL offsets in the kernel in the specified file stream 00274 void StoreBblBoundsInfo(std::ofstream& fs); 00275 00276 /// Store Global Thread Identifier 00277 void StoreGlobalTid(uint32_t gtid, std::ofstream& fs); 00278 00279 /// Store the specified value in the specified file stream in the binary format 00280 template <typename T> void Store(const T& val, std::ofstream& fs) { fs.write((const char*)&val, sizeof(val)); } 00281 00282 /// Store kernel's Control Flow Graph 00283 void StoreCfg(std::ofstream& fs); 00284 00285 private: 00286 00287 using TraceRecord = std::pair<uint32_t, uint32_t>; ///< Trace record 00288 using TraceRecordList = std::list<TraceRecord>; ///< List of ItraceRecords per single HW thread 00289 using PerTileTraceRecords = std::vector<TraceRecordList>; ///< Per tile Itrace records 00290 typedef std::map<ItraceKernel::Edge, uint32_t> ItraceCfg; ///< Weightened Control-flow graph 00291 00292 private: 00293 const ItraceKernel* _kernel; ///< Kernel&traces to be processed 00294 std::string _kernelDir; ///< Directory to store kernel's trace files 00295 std::vector<PerTileTraceRecords> _itraceRecords; ///< Map of tile ID to list of BBL trace with their frequencies, indexed by the thread ID 00296 std::vector<uint32_t> _numProfiledThreads;///< Number of profiled (active) threads per tile 00297 ItraceCfg _cfg; ///< Control-flow graph 00298 static const char* _traceFileName; ///< Name of the file to store trace in 00299 }; 00300 00301 #endif
00001 /*========================== begin_copyright_notice ============================ 00002 Copyright (C) 2018-2025 Intel Corporation 00003 00004 SPDX-License-Identifier: MIT 00005 ============================= end_copyright_notice ===========================*/ 00006 00007 /*! 00008 * @file Implementation of the Itrace tool 00009 */ 00010 00011 #include <fstream> 00012 00013 #include "itrace.h" 00014 #include "gtpin_tool_utils.h" 00015 00016 using namespace gtpin; 00017 using namespace std; 00018 00019 /* ============================================================================================= */ 00020 // Configuration 00021 /* ============================================================================================= */ 00022 Knob<int> gKnobMaxTraceBufferInMB("max_buffer_mb", 3072, "itrace - the max allowed size of the trace buffer per kernel in MB"); 00023 Knob<int> gKnobPhase("phase", 0, "tracing tool - processing phase\n { 1 - pre-processing, 2 - processing - trace gathering}"); 00024 Knob<bool> gKnobCfgOnly("cfg-only", false, "indicates that the collected trace should not be saved - only resulting cfg file"); 00025 00026 /* ============================================================================================= */ 00027 // ItraceDispatch implementation 00028 /* ============================================================================================= */ 00029 bool ItraceDispatch::ReadTrace(const GtProfileTrace& traceAccessor, const IGtProfileBuffer& profileBuffer) 00030 { 00031 uint32_t traceSize = traceAccessor.Size(profileBuffer); 00032 _rawTrace.resize(traceSize); 00033 _isTrimmed = traceAccessor.IsTruncated(profileBuffer); 00034 return traceAccessor.Read(profileBuffer, _rawTrace.data(), 0, traceSize); 00035 } 00036 00037 bool ItraceDispatch::IsEmpty() const 00038 { 00039 return _rawTrace.size() < sizeof(ItraceRecord); 00040 } 00041 00042 /* ============================================================================================= */ 00043 // ItraceKernel implementation 00044 /* ============================================================================================= */ 00045 ItraceKernel::ItraceKernel(const IGtKernelInstrument& kernelInstrument, uint32_t numTiles) : _numTiles(numTiles) 00046 { 00047 const IGtKernel& kernel = kernelInstrument.Kernel(); 00048 const IGtCfg& cfg = kernelInstrument.Cfg(); 00049 00050 _name = GlueString(kernel.Name()); 00051 _extName = ExtendedKernelName(kernel); 00052 _platform = kernel.GpuPlatform(); 00053 _genId = kernel.GenModel().Id(); 00054 _asmText = CfgAsmText(cfg); 00055 _uniqueName = kernel.UniqueName(); 00056 00057 // Initialize trace accessor. The trace capacity is expected to be computed during the preprocessing phase. 00058 uint64_t traceCapacity = ItracePreProcessor::Instance()->TraceSize(_extName); 00059 if (traceCapacity == 0) 00060 { 00061 // Unknown trace capacity 00062 GTPIN_WARNING("ITRACE: unknown trace capacity for kernel " + _name + ". Assuming the kernel is filtered out. " 00063 "Allocating a buffer of 8KB size. If the kernel is supposed to run, expect buffer overflow. " 00064 "In this case, please re-run phase 1 and make sure the kernel is not filtered out."); 00065 traceCapacity = 0x2000; 00066 } 00067 else 00068 { 00069 traceCapacity += 0x2000; // Add some space to account for possible fluctuation of trace sizes between phases 00070 if (traceCapacity > UINT32_MAX) 00071 { 00072 GTPIN_WARNING("ITRACE: The kernel " + _name + " exceedeed maximum trace capacity."); 00073 traceCapacity = UINT32_MAX; 00074 } 00075 } 00076 if (traceCapacity > (uint64_t(gKnobMaxTraceBufferInMB) * 0x100000)) 00077 { 00078 GTPIN_WARNING("ITRACE: required capacity (" + DecStr(traceCapacity) + ") for kernel " + _name + " is too big - cut to " + DecStr(gKnobMaxTraceBufferInMB) + "MB. " 00079 "Expect the final trace to contain partial data."); 00080 traceCapacity = uint64_t(gKnobMaxTraceBufferInMB) * 0x100000; 00081 } 00082 uint32_t maxRecordSize = sizeof(ItraceRecord); 00083 _traceAccessor = GtProfileTrace((uint32_t)traceCapacity, maxRecordSize); 00084 _traceAccessor.Allocate(kernelInstrument.ProfileBufferAllocator()); 00085 // Fill basic block offsets info 00086 for (auto bblPtr : cfg.Bbls()) 00087 { 00088 const IGtBbl& bbl = *bblPtr; 00089 BblId bblId = bbl.Id(); 00090 const IGtIns& insHead = bbl.FirstIns(); 00091 const IGtIns& insTail = bbl.LastIns(); 00092 uint32_t offsetHead = cfg.GetInstructionOffset(insHead); 00093 uint32_t offsetTail = insTail.IsChangingIP() ? uint32_t(cfg.GetInstructionOffset(insTail)) : 0xFFFFFFFF; 00094 _bblBoundsMap[bblId] = BblBounds(offsetHead, offsetTail); 00095 const EdgeSpan& outgoingEdges = bbl.OutgoingEdges(); 00096 for (auto outEdge : outgoingEdges) 00097 { 00098 const IGtBbl& dstBbl = outEdge->DstBbl(); 00099 uint32_t dstBblId = dstBbl.Id(); 00100 _edges.emplace(bblId, dstBblId); 00101 } 00102 } 00103 } 00104 00105 ItraceDispatch& ItraceKernel::AddItrace(IGtKernelDispatch& kernelDispatch) 00106 { 00107 // Create a new ItraceDispatch object and store the entire trace within this object 00108 _traces.emplace_back(kernelDispatch); 00109 ItraceDispatch& ItraceDispatch = _traces.back(); 00110 if (!ItraceDispatch.ReadTrace(_traceAccessor, *kernelDispatch.GetProfileBuffer())) 00111 { 00112 GTPIN_ERROR_MSG("ITRACE: Failed to read profile buffer for kernel " + _name); 00113 } 00114 return ItraceDispatch; 00115 } 00116 00117 void ItraceKernel::DumpAsm() const 00118 { 00119 DumpKernelAsmText(_name, _uniqueName, _asmText); 00120 } 00121 00122 /* ============================================================================================= */ 00123 // Itrace implementation 00124 /* ============================================================================================= */ 00125 Itrace* Itrace::Instance() 00126 { 00127 static Itrace instance; 00128 return &instance; 00129 } 00130 00131 void Itrace::OnKernelBuild(IGtKernelInstrument& instrumentor) 00132 { 00133 const IGtKernel& kernel = instrumentor.Kernel(); 00134 uint32_t numTiles = (instrumentor.Coder().IsTileIdSupported()) ? GTPin_GetCore()->GenArch().MaxTiles(kernel.GpuPlatform()) : 1; 00135 00136 // Create new KernelData object and add it to the data base 00137 auto ret = _kernels.emplace(piecewise_construct, forward_as_tuple(kernel.Id()), forward_as_tuple(instrumentor, numTiles)); 00138 if (ret.second) 00139 { 00140 ItraceKernel& ItraceKernel = (*ret.first).second; 00141 if (!ItraceKernel.IsEnabled()) 00142 { 00143 GTPIN_WARNING("ITRACE: The trace won't be generated for kernel " + ItraceKernel.Name()); 00144 return; 00145 } 00146 00147 const IGtCfg& cfg = instrumentor.Cfg(); 00148 IGtVregFactory& vregs = instrumentor.Coder().VregFactory(); 00149 00150 // Initialize virtual registers 00151 _addrReg = vregs.MakeMsgAddrScratch(); 00152 _dataReg = vregs.MakeMsgDataScratch(); 00153 _offsetReg = vregs.MakeScratch(VREG_TYPE_DWORD); 00154 _tileIdReg = vregs.Make(VREG_TYPE_DWORD); 00155 00156 GtGenProcedure preCode; 00157 instrumentor.Coder().LoadTileId(preCode, _tileIdReg); 00158 00159 // Instrument kernel entries 00160 instrumentor.InstrumentEntries(preCode); 00161 00162 // Instrument basic blocks 00163 for (auto bblPtr : cfg.Bbls()) 00164 { 00165 InstrumentBbl(instrumentor, *bblPtr, ItraceKernel); 00166 } 00167 } 00168 } 00169 00170 void Itrace::OnKernelRun(IGtKernelDispatch& dispatcher) 00171 { 00172 bool isProfileEnabled = false; 00173 00174 const IGtKernel& kernel = dispatcher.Kernel(); 00175 GtKernelExecDesc execDesc; dispatcher.GetExecDescriptor(execDesc); 00176 if (kernel.IsInstrumented() && IsKernelExecProfileEnabled(execDesc, kernel.GpuPlatform(), kernel.Name().Get())) 00177 { 00178 auto it = _kernels.find(kernel.Id()); 00179 if (it != _kernels.end()) 00180 { 00181 const ItraceKernel& ItraceKernel = it->second; 00182 if (ItraceKernel.IsEnabled()) 00183 { 00184 IGtProfileBuffer* buffer = dispatcher.CreateProfileBuffer(); GTPIN_ASSERT(buffer); 00185 const GtProfileTrace& traceAccessor = ItraceKernel.TraceAccessor(); 00186 if (traceAccessor.Initialize(*buffer)) 00187 { 00188 isProfileEnabled = true; 00189 } 00190 else 00191 { 00192 GTPIN_ERROR_MSG("ITRACE: Failed to write into memory buffer for kernel " + string(kernel.Name())); 00193 } 00194 } 00195 } 00196 } 00197 dispatcher.SetProfilingMode(isProfileEnabled); 00198 } 00199 00200 void Itrace::OnKernelComplete(IGtKernelDispatch& dispatcher) 00201 { 00202 if (!dispatcher.IsProfilingEnabled()) 00203 { 00204 return; // Do nothing with unprofiled kernel dispatches 00205 } 00206 00207 const IGtKernel& kernel = dispatcher.Kernel(); 00208 auto it = _kernels.find(kernel.Id()); 00209 if (it != _kernels.end()) 00210 { 00211 // Read the trace from the profile buffer 00212 ItraceKernel& ItraceKernel = it->second; 00213 ItraceKernel.AddItrace(dispatcher); 00214 } 00215 } 00216 00217 bool Itrace::InstrumentBbl(IGtKernelInstrument& instrumentor, const IGtBbl& bbl, const ItraceKernel& ItraceKernel) 00218 { 00219 const IGtGenCoder& coder = instrumentor.Coder(); 00220 const IGtCfg& cfg = instrumentor.Cfg(); 00221 00222 // Generate code that allocates space for the new record in the trace and stores the trace record. 00223 // Insert this procedure before the first instruction in the basic block. 00224 GtGenProcedure headerProc; 00225 auto firstInsIt = bbl.Instructions().begin(); 00226 const IGtIns& firstIns = cfg.GetInstruction((*firstInsIt)->Id()); 00227 StoreRecord(headerProc, coder, bbl, ItraceKernel, sizeof(ItraceRecord)); 00228 instrumentor.InstrumentInstruction(firstIns, GtIpoint::Before(), headerProc); 00229 00230 return true; 00231 } 00232 00233 void Itrace::StoreRecord(GtGenProcedure& proc, const IGtGenCoder& coder, const IGtBbl& bbl, 00234 const ItraceKernel& ItraceKernel, uint32_t recordSize) 00235 { 00236 IGtInsFactory& insF = coder.InstructionFactory(); 00237 00238 GtPredicate predicate(FlagReg(0)); 00239 00240 // Set values of ItraceRecord fields in _dataReg 00241 proc += insF.MakeShl(_dataReg, StateReg(0), 16); // idFieldReg[16:31] = sr0.0 00242 proc += insF.MakeAdd(_dataReg, _dataReg, GtImmU32(bbl.Id())); // idFieldReg[0:15] = bbl.Id() 00243 00244 // Allocate new record in the trace. 00245 // Set _offsetReg = offset of the allocated record in the profile buffer, _addrReg = address of the allocated record 00246 ItraceKernel.TraceAccessor().ComputeNewRecordOffset(coder, proc, recordSize, _offsetReg); 00247 coder.ComputeAddress(proc, _addrReg, _offsetReg); 00248 00249 // Zero _offsetReg if the trace buffer is overflowed (predicate == true) 00250 proc += insF.MakeMov(_offsetReg, 0).SetPredicate(predicate); 00251 00252 // Store Sr0.0 and BBL ID 00253 //if (!predicate) { STORE buffer[_addrReg] = _dataReg; 00254 proc += insF.MakeAtomicStore(_addrReg, _dataReg).SetPredicate(!predicate); 00255 00256 // Store tile ID 00257 //if (!predicate) { STORE buffer[_addrReg] = _dataReg; 00258 proc += insF.MakeMov(_dataReg, _tileIdReg); 00259 coder.ComputeRelAddress(proc, _addrReg, _addrReg, offsetof(ItraceRecord, tileId)); 00260 proc += insF.MakeAtomicStore(_addrReg, _dataReg).SetPredicate(!predicate); 00261 00262 if (!proc.empty()) { proc.front()->AppendAnnotation(__func__); } 00263 } 00264 00265 void Itrace::OnFini() 00266 { 00267 Itrace& me = *Instance(); 00268 IGtCore* gtpinCore = GTPin_GetCore(); 00269 for (auto& ref : me._kernels) 00270 { 00271 const ItraceKernel& ItraceKernel = ref.second; 00272 ItracePostProcessor(*gtpinCore, ItraceKernel)(); 00273 ItraceKernel.DumpAsm(); 00274 } 00275 } 00276 00277 /* ============================================================================================= */ 00278 // ItracePreProcessor implementation 00279 /* ============================================================================================= */ 00280 const char* ItracePreProcessor::_kernelPreProcessFileName = "itrace_pre_process.txt"; 00281 const char* ItracePreProcessor::_dispatchPreProcessFileName = "itrace_pre_process_dispatch.txt"; 00282 00283 ItracePreProcessor::ItracePreProcessor() 00284 { 00285 if (gKnobPhase == 2) 00286 { 00287 // Read the data collected during the preprocessing phase 00288 std::ifstream is(_kernelPreProcessFileName); 00289 GTPIN_ASSERT_MSG(is, string("File ") + _kernelPreProcessFileName + " does not exist. The trace won't be generated"); 00290 is >> _kernelCounters; 00291 } 00292 else if (gKnobPhase == 1) 00293 { 00294 // Create pre_process files or remove old pre_process files's content if they exist 00295 CreateCleanFile(_kernelPreProcessFileName); 00296 CreateCleanFile(_dispatchPreProcessFileName); 00297 } 00298 } 00299 00300 ItracePreProcessor* ItracePreProcessor::Instance() 00301 { 00302 static ItracePreProcessor instance; 00303 return &instance; 00304 } 00305 00306 void ItracePreProcessor::OnFini() 00307 { 00308 ItracePreProcessor& tool = *Instance(); 00309 tool.DumpKernelProfiles(_kernelPreProcessFileName); 00310 tool.DumpDispatchProfiles(_dispatchPreProcessFileName); 00311 } 00312 00313 uint64_t ItracePreProcessor::TraceSize(const string& extKernelName) const 00314 { 00315 auto it = _kernelCounters.find(extKernelName); 00316 return ((it == _kernelCounters.end()) ? 0 : it->second.weight); 00317 } 00318 00319 uint32_t ItracePreProcessor::GetBblWeight(IGtKernelInstrument&, const IGtBbl&) const 00320 { 00321 return sizeof(ItraceRecord); 00322 } 00323 00324 void ItracePreProcessor::AggregateDispatchCounters(KernelWeightCounters& kc, KernelWeightCounters dc) const 00325 { 00326 kc.weight = std::max(kc.weight, dc.weight); 00327 kc.freq += dc.freq; 00328 } 00329 00330 /* ============================================================================================= */ 00331 // ItracePostProcessor implementation 00332 /* ============================================================================================= */ 00333 const char* ItracePostProcessor::_traceFileName = "itrace_compressed.bin"; 00334 00335 ItracePostProcessor::ItracePostProcessor(const IGtCore& gtpinCore, const ItraceKernel& ItraceKernel) : 00336 _kernel(&ItraceKernel), 00337 _kernelDir(JoinPath(string(gtpinCore.ProfileDir()), ItraceKernel.UniqueName())) {} 00338 00339 bool ItracePostProcessor::operator()() 00340 { 00341 if (!MakeDirectory(_kernelDir)) 00342 { 00343 GTPIN_WARNING("ITRACE: Could not create directory " + _kernelDir); 00344 return false; 00345 } 00346 00347 // Process traces recorded in kernel dispatches 00348 for (const ItraceDispatch& trace : _kernel->GetTraces()) 00349 { 00350 if (!trace.IsEmpty()) 00351 { 00352 if (trace.IsTrimmed()) 00353 { 00354 GTPIN_WARNING("ITRACE: Detected trace buffer overflow in kernel " + _kernel->Name()); 00355 } 00356 00357 ProcessTrace(trace); 00358 00359 string subdir = trace.KernelExecDesc().ToString(_kernel->Platform(), ExecDescFileNameFormat()); 00360 string dir = MakeSubDirectory(_kernelDir, subdir); 00361 00362 if (!gKnobCfgOnly) 00363 { 00364 string filePath = JoinPath(dir, _traceFileName); 00365 ofstream fs(filePath, std::ios::binary); 00366 if (!fs) 00367 { 00368 GTPIN_WARNING("ITRACE: Could not create file " + filePath); 00369 continue; 00370 } 00371 StoreTrace(fs); 00372 } 00373 00374 string cfgFilePath = JoinPath(dir, "itrace_total.cfg"); 00375 ofstream cfgfs(cfgFilePath); 00376 if (!cfgfs) 00377 { 00378 GTPIN_WARNING("ITRACE: Could not create file " + cfgFilePath); 00379 continue; 00380 } 00381 StoreCfg(cfgfs); 00382 } 00383 } 00384 00385 return true; 00386 } 00387 00388 void ItracePostProcessor::ProcessTrace(const ItraceDispatch& trace) 00389 { 00390 const uint8_t* traceData = trace.Data(); 00391 uint32_t traceSize = trace.Size(); 00392 00393 // Associate trace records with threads - populate _threadTraceRecords array 00394 const GtStateRegAccessor& sra = _kernel->GenModel().StateRegAccessor(); 00395 uint32_t maxThreads = _kernel->GenModel().MaxThreads(); // Max number of HW threads 00396 00397 // Build control-flow graph 00398 _cfg.clear(); 00399 ItraceKernel::Edges edges = _kernel->GetEdges(); 00400 for (auto& edge : edges) 00401 { 00402 _cfg.emplace(edge, 0); 00403 } 00404 00405 // Reference to the trace record 00406 struct Record 00407 { 00408 const ItraceRecord* header; ///< Pointer to the header of the record 00409 uint32_t size; ///< Size of the record in bytes, including header 00410 }; 00411 00412 using RecordList = std::list<Record>; ///< List of Records 00413 using PerTileRecords = std::vector<RecordList>; ///< Per tile Records 00414 std::vector<PerTileRecords> threadRecords; ///< Vector of per tile records. Tile ID is an index to this vector 00415 00416 threadRecords.resize(_kernel->NumTiles()); 00417 _itraceRecords.resize(_kernel->NumTiles()); 00418 _numProfiledThreads.resize(_kernel->NumTiles(), 0); 00419 00420 for (uint32_t tile = 0; tile < _kernel->NumTiles(); tile++) 00421 { 00422 threadRecords[tile].resize(maxThreads); 00423 00424 _itraceRecords[tile].clear(); 00425 _itraceRecords[tile].resize(maxThreads); 00426 } 00427 00428 for (uint32_t recordOffset = 0; recordOffset + sizeof(ItraceRecord) <= traceSize;) 00429 { 00430 // Retrive thread ID and BBL ID from the record 00431 const ItraceRecord* record = (const ItraceRecord*)(traceData + recordOffset); 00432 uint32_t tid = sra.GetGlobalTid(record->sr0); 00433 uint32_t tileId = record->tileId; GTPIN_ASSERT(tileId < _kernel->NumTiles()); 00434 uint32_t recordSize = sizeof(ItraceRecord); 00435 if (recordOffset + recordSize > traceSize) 00436 { 00437 break; // end of trace 00438 } 00439 00440 auto& tileRecords = threadRecords[tileId]; 00441 auto& records = tileRecords[tid]; 00442 00443 // Add a new trace record reference to _threadTraceRecords 00444 if (records.empty()) { ++_numProfiledThreads[tileId]; } // Increment thread count on the first relevant record 00445 records.emplace_back(Record{ record, recordSize }); 00446 00447 recordOffset += recordSize; 00448 } 00449 00450 for (uint32_t tileId = 0; tileId < threadRecords.size(); tileId++) 00451 { 00452 if (_numProfiledThreads[tileId] == 0) 00453 { 00454 continue; 00455 } 00456 00457 const auto& tileRecords = threadRecords[tileId]; 00458 auto& tileItraceRecords = _itraceRecords[tileId]; 00459 00460 // Store per-thread traces 00461 for (uint32_t tid = 0; tid < maxThreads; tid++) 00462 { 00463 const auto& threadRecordList = tileRecords[tid]; 00464 auto& records = tileItraceRecords[tid]; 00465 00466 if (threadRecordList.empty()) 00467 { 00468 continue; 00469 } 00470 00471 // Store trace records 00472 BblId prevBblId; 00473 for (const auto& record : threadRecordList) 00474 { 00475 const auto& header = *(record.header); 00476 BblId bblId = header.bblId; 00477 00478 if (prevBblId != bblId) 00479 { 00480 records.emplace_back(bblId, 1); 00481 } 00482 else 00483 { 00484 auto& r = records.back(); 00485 r.second++; 00486 } 00487 00488 if (prevBblId.IsValid()) 00489 { 00490 auto it = _cfg.find(ItraceKernel::Edge(prevBblId, bblId)); 00491 if (it != _cfg.end()) 00492 { 00493 it->second += 1; 00494 } 00495 } 00496 00497 prevBblId = bblId; 00498 } 00499 } 00500 } 00501 } 00502 00503 void ItracePostProcessor::StoreTrace(std::ofstream& fs) 00504 { 00505 StoreBblBoundsInfo(fs); 00506 00507 // Compute and store the number of involved tiles 00508 uint32_t numOfTiles = 0; 00509 for (uint32_t i = 0; i < _numProfiledThreads.size(); i++) 00510 { 00511 numOfTiles += (_numProfiledThreads[i] == 0) ? 0 : 1; 00512 } 00513 Store(numOfTiles, fs); 00514 00515 for (uint32_t tileId = 0; tileId < _itraceRecords.size(); tileId++) 00516 { 00517 if (_numProfiledThreads[tileId] == 0) { continue; } 00518 00519 const auto& tileRecords = _itraceRecords[tileId]; 00520 00521 Store(tileId, fs); 00522 00523 // Store the number of profiled threads 00524 Store(_numProfiledThreads[tileId], fs); 00525 00526 // Store per-thread traces 00527 for (uint32_t tid = 0; tid < tileRecords.size(); tid++) 00528 { 00529 const auto& records = tileRecords[tid]; 00530 00531 if (records.empty()) 00532 { 00533 continue; 00534 } 00535 00536 StoreGlobalTid(tid, fs); // Store Global Thread Identifier 00537 00538 uint32_t numRecords = (uint32_t)records.size(); 00539 Store(numRecords, fs); // Store #records collected in the thread 00540 00541 // Store trace records 00542 for (const auto& record : records) 00543 { 00544 uint32_t bblId = record.first; 00545 uint32_t loopCount = record.second; 00546 Store(bblId, fs); // Store BBL ID 00547 Store(loopCount, fs); // Store loopCount 00548 } 00549 } 00550 } 00551 } 00552 00553 void ItracePostProcessor::StoreBblBoundsInfo(std::ofstream& fs) 00554 { 00555 // Store static information about memory accesses in BBLs 00556 ItraceKernel::BblBoundsMap bblBoundsMap = _kernel->GetBblBounds(); 00557 uint32_t numBbls = (uint32_t)bblBoundsMap.size(); 00558 Store(numBbls, fs); // Store the number of BBLs that access memory 00559 00560 for (uint32_t bblId = 0; bblId < numBbls; bblId++) 00561 { 00562 auto bounds = bblBoundsMap[bblId]; 00563 Store(bblId, fs); // Store BBL ID 00564 uint32_t val = bounds.first; 00565 Store(val, fs); 00566 val = bounds.second; 00567 Store(val, fs); 00568 } 00569 } 00570 00571 void ItracePostProcessor::StoreGlobalTid(uint32_t gtid, std::ofstream& fs) 00572 { 00573 const GtStateRegAccessor& sra = _kernel->GenModel().StateRegAccessor(); 00574 uint32_t sr0 = sra.SetGlobalTid(0, gtid); 00575 00576 auto storeSr0Field = [&](const ScatteredBitFieldU32& sbf) 00577 { 00578 uint32_t val = (sbf.IsEmpty() ? UINT32_MAX : sbf.GetValue(sr0)); 00579 Store(val, fs); 00580 }; 00581 00582 storeSr0Field(sra.SliceIdField()); 00583 storeSr0Field(sra.DualSubSliceIdField()); 00584 storeSr0Field(sra.SubSliceIdField()); 00585 storeSr0Field(sra.EuIdField()); 00586 storeSr0Field(sra.ThreadSlotField()); 00587 } 00588 00589 void ItracePostProcessor::StoreCfg(std::ofstream& fs) 00590 { 00591 ostringstream ostr; 00592 ostr << "srcBBL, dstBBL, Frequency" << std::endl; 00593 ostr << "=========================" << std::endl; 00594 for (const auto& it : _cfg) 00595 { 00596 ItraceKernel::Edge edge = it.first; 00597 uint32_t frequency = it.second; 00598 00599 // print srcBBL, dstBBL, frequency 00600 ostr << std::dec << edge.first << "," << edge.second << "," << frequency << std::endl; 00601 } 00602 fs << ostr.str(); 00603 fs.close(); 00604 } 00605 00606 /* ============================================================================================= */ 00607 // GTPin_Entry 00608 /* ============================================================================================= */ 00609 EXPORT_C_FUNC void GTPin_Entry(int argc, const char* argv[]) 00610 { 00611 ConfigureGTPin(argc, argv); 00612 if (gKnobPhase == 1) 00613 { 00614 ItracePreProcessor::Instance()->Register(); 00615 atexit(ItracePreProcessor::OnFini); 00616 } 00617 else 00618 { 00619 GTPIN_ASSERT_MSG((gKnobPhase == 2), "Itrace: Invalid phase value. Should be 1 or 2, provided " + std::to_string(gKnobPhase)); 00620 Itrace::Instance()->Register(); 00621 atexit(Itrace::OnFini); 00622 } 00623 } 00624
(Back to the list of all GTPin Sample Tools)
Copyright (C) 2013-2025 Intel Corporation
SPDX-License-Identifier: MIT
1.7.4