CUBLAS
更新时间:2024-04-20 16:09:01 阅读量: 综合文库 文档下载
The library is self‐contained at the API level, that is, no direct interaction with the CUDA driver is necessary The interface to the CUBLAS library is the header file cublas.h.
The type cublasStatus is used for function status returns. cublasStatus
cublasInit (void)
initializes the CUBLAS library and must be called before any other
CUBLAS API function is invoked. It allocates hardware resources
necessary for accessing the GPU.
cublasStatus
cublasShutdown (void)
releases CPU‐side resources used by the CUBLAS library.
cublasStatus
cublasGetError (void)
returns the last error that occurred on invocation of any of the
CUBLAS core functions.
cublasStatus
cublasAlloc (int n, int elemSize, void **devicePtr) 用CUBLAS分配的空间和用cudaMalloc分配的是等价的
creates an object in GPU memory space capable of holding an array of
n elements, where each element requires elemSize bytes of storage. If
the function call is successful, a pointer to the object in GPU memory space is placed in devicePtr. Note that this is a device pointer that
cannot be dereferenced in host code. Function cublasAlloc() is a
wrapper around cudaMalloc(). Device pointers returned by cublasAlloc() can therefore be passed to any CUDA device kernels,
not just CUBLAS functions.
cublasStatus
cublasFree (const void *devicePtr)
destroys the object in GPU memory space referenced by devicePtr. 释放显存空间
cublasStatus
cublasSetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)
copies n elements from a vector x in CPU memory space to a vector y
in GPU memory space. Elements in both vectors are assumed to have a
size of elemSize bytes. Storage spacing between consecutive elements
is incx for the source vector x and incy for the destination vector y. In general, y points to an object, or part of an object, allocated via
cublasAlloc().
主机到设备端的向量数据拷贝
cublasStatus
cublasGetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)
copies n elements from a vector x in GPU memory space to a vector y
in CPU memory space. Elements in both vectors are assumed to have a
size of elemSize bytes. Storage spacing between consecutive elements
is incx for the source vector x and incy for the destination vector y.
显存到主机的向量数据拷贝 问题:参数incx和incy什么意思?
incx :storage spacing between elements of x abs(x[1 + i * incx]) i = 0 to n-1 存储间隔
i = 0,存储在x[1]位置
i = 1,存储在x[1 + incx]位置 i = 2,存储在x[1 + 1 * incx]位置 i = 3,存储在x[1 + 2 * incx]位置
cublasStatus
cublasSetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
copies a tile of rows×cols elements from a matrix A in CPU memory space to a matrix B in GPU memory space. Each element requires storage of elemSize bytes. 主机到设备端的矩阵数据拷贝
cublasStatus
cublasGetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
copies a tile of rows×cols elements from a matrix A in GPU memory space to a matrix B in CPU memory space. Each element requires storage of elemSize bytes. 设备到主机端的矩阵数据拷贝 int
cublasIsamax (int n, const float *x, int incx)
finds the smallest index of the maximum magnitude element of singleprecision
vector x; that is, the result is the first i, i = 0 to n-1, that maximizes abs(x[1 + i * incx]) 找到最大值的下标 int
cublasIsamin (int n, const float *x, int incx)
finds the smallest index of the minimum magnitude element of singleprecision
vector x; that is, the result is the first i, i = 0 to n-1,
that
minimizesabs(x[1 + i * incx]) 找到最小值的下标
float
cublasSasum (int n, const float *x, int incx)
computes the sum of the absolute values of the elements of singleprecision
vector x; that is, the result is the sum from i = 0 to n-1 of abs(x[1 + i * incx]) 求n个元素的绝对值之和
参数:
n number of elements in input vector
x single-precision vector with n elements incx storage spacing between elements of x
scalar 英['skeil?] 美['skel?, -,lɑr]
adj.梯状的,分等级的,数量的,标量的 n.数量,标量
precision 英[pri'si??n] 美[pr?'s???n]
n.精确度, 准确(性) adj.精确的;准确的;细致的
参数意义基本一样 向量类型和矩阵类型 void
cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)
multiplies single‐precision vector x by single‐precision
scalar alpha
and adds the result to single‐precision vector y; that is, it overwrites
single‐precision y with single‐precision alpha * x + y For i = 0 to n-1, it replaces y[ly + i * incy] alpha * x[lx + i * incx] + y[ly + i * incy
lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx 当incx >= 0时,是正的间隔,一般调用时为1即可 当incx < 0时,时负的,incx = -1时,间隔为n,
void
cublasScopy (int n, const float *x, int incx, float *y, int incy)
copies the single‐precision vector x to the single‐precision vector y. For
i = 0 to n-1, it copies x[lx + i * incx] y[ly + i * incy] 拷贝函数
float
cublasSdot (int n, const float *x, int incx, const float *y, int incy)
computes the dot product of two single‐precision vectors. It returns
the dot product of the single‐precision vectors x and y if successful,
and 0.0f otherwise. It computes the sum for i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量的点乘
void
cublasSrot (int n, float *x, int incx, float *y, int incy, float sc, float ss)
动脑子分析更易有兴趣:
不动脑子只是一摊子看,最易浪费时间,因为无趣也易不专心。
无论MTI OCW还是The Interpretation Of Dream,都需要用心专心分析才能有收获。
分析CUBLAS函数类型:
需要:矩阵减法函数,矩阵与向量的减法和乘法函数
分析纹理:只要将CUDA数组或者线性内存与纹理绑定后,直接读取其中数据即可,不做任何变换。
用到过的CUBLAS函数有: cublasAlloc
cublasSetVector cublasSgemm cublasGetError cublasGetVector cublasFree cublasInit
cublasShutdown
1. 分配空间,数据传输函数: cublasAlloc
cublasSetVector cublasGetVector cublasFree
2. 初始化,推出CUBLAS函数: cublasInit
cublasShutdown 3. 运算函数:
cublasSgemm cublasSgemv
cublasSdot cublasSaxpy cublasSscal
下面一个一个分析其用法,并在例子中测试:各个参数的意义,一般用法
float
cublasSdot (int n, const float *x, int incx, const float *y, int incy)
computes the dot product of two single‐precision vectors. It returns the dot product of the single‐precision vectors x and y if successful, and 0.0f otherwise. It computes the sum for
i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量x 和 y 的点乘,
lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx ;
参数:
n :number of elements in input vectors x :single-precision vector with n elements incx: storage spacing between elements of x y :single-precision vector with n elements incy: storage spacing between elements of y incx和incy调用时一般设为1
alpha = r2 / cublasSdot(n, p, 1, Ap, 1)
status = cublasAlloc(n, sizeof(float), (void **)&p); status = cublasAlloc(n, sizeof(float), (void **)&Ap);
1.
用cublasAlloc分配的空间是否下标从1开始
y[ly + i * incy]
分析:incy = 1时: i = 0放在y[1]里 i = 1放在y[2]里
2.
用cublasAlloc分配的空间不能简单的等价于cudaMalloc分配
的空间。
返回值:
returns single-precision dot product (returns zero if n <= 0)
Error Status
CUBLAS_STATUS_NOT_INITIALIZED if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED if function could not allocate reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED if function failed to execute on GPU
The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. To accomplish this, CUBLAS provides helper functions for creating and destroying objects in GPU space, and for writing data to and retrieving data from these objects.
For maximum compatibility with existing Fortran environments, CUBLAS uses column‐major storage and 1‐based indexing.(注意这里:CUBLAS利用了列优先的存储方式和以1开始的下标方式)。
Since C and C++ use row‐major storage,(行优先排列) applications cannot use the native array semantics for two‐dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of onedimensional arrays. For Fortran code ported to C in mechanical fashion, one may chose to retain 1‐based indexing to avoid the need to transform loops. In this case, the array index of a matrix element in row i and column j can be computed via the following macro:
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) (采用1‐based indexing)
Here, ld refers to the leading dimension of the matrix as allocated,(按之优先排列的维数)which in the case of column‐major storage is the number of rows. For natively written C and C++ code, one would most likely chose 0‐based indexing, in which case the indexing macro becomes
#define IDX2C(i,j,ld) (((j)*(ld))+(i)) (采用0‐based indexing)
看简单的例子做一下,看结果中体会其用处。
cublasStatus cublasInit (void)
initializes the CUBLAS library and must be called before any other CUBLAS API function is invoked. It allocates hardware resources necessary for accessing the GPU.
初始化函数
cublasStatus
cublasShutdown (void)
releases CPU‐side resources used by the CUBLAS library. The release of GPU‐side resources may be deferred until the application shuts down.
释放资源
1. 2.
CUBLAS helper functions help函数 CUBLAS core functions. Core函数
cublasStatus
cublasGetError (void)
returns the last error that occurred on invocation of any of the CUBLAS core functions.
Reading the error status via cublasGetError() resets the internal error state to
CUBLAS_STATUS_SUCCESS.
在.cpp文件里可以直接调用CUBLAS中函数
void CcublasDlg::simple_sgemm(int n, float alpha, const float *A, const float *B,
float beta, float *C) {
int i; int j; int k;
for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j) { float prod = 0;
for (k = 0; k < n; ++k) {
prod += A[k * n + i] * B[j * n + k]; }
C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }
分析:计算alpha * AB + beta * C的值。 对k求和:A(i , k) * B(k, j) = C(i, j)
A(i , k)元素存放在A[k * n + i]位置,k * n可见是按列存放的。 这样存放与CUBLAS中CUBLAS uses column‐major storage and 1‐based indexing.
保持了一致。
留意下标怎么一致的? 区分用cublasAlloc和用cudaMalloc
M + alloc:memory allocate 内存分配 只有实践中才能体会计算机的作用。
main()函数也可以有两个参数,int main(int argc, char** argv)
texture memory就是在global memory上的,是人为划出的有“cache”的“global memory”,定义时一般用texture
纹理一般适用于固定的表结构,适合随机读取。
texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读
评:将常量数组与纹理绑定,则 w_d,noise_d,an_d,mn_d,cos_d,sin_d都可以用纹理拾取来读取
至于CUBLAS,因为两个矩阵对应的元素相乘,不能转化为矩阵运算 不过去噪可以只用两个vector来做
保持了一致。
留意下标怎么一致的? 区分用cublasAlloc和用cudaMalloc
M + alloc:memory allocate 内存分配 只有实践中才能体会计算机的作用。
main()函数也可以有两个参数,int main(int argc, char** argv)
texture memory就是在global memory上的,是人为划出的有“cache”的“global memory”,定义时一般用texture
纹理一般适用于固定的表结构,适合随机读取。
texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读
评:将常量数组与纹理绑定,则 w_d,noise_d,an_d,mn_d,cos_d,sin_d都可以用纹理拾取来读取
至于CUBLAS,因为两个矩阵对应的元素相乘,不能转化为矩阵运算 不过去噪可以只用两个vector来做
正在阅读:
CUBLAS04-20
培训心得体会,2022年培训心得体会范文_培训心得体会怎么写08-01
瑞士洛桑酒店管理学院硕士04-23
关于学习张玉滚主要先进事迹心得参考范本08-04
完整的ERP流程图大全09-16
超市管理系统需求分析09-08
远方的梦想作文500字06-19
- 计算机试题
- 【2012天津卷高考满分作文】鱼心人不知
- 教育心理学历年真题及答案--浙江教师资格考试
- 20180327-第六届“中金所杯”全国大学生金融知识大赛参考题库
- 洪林兴达煤矿2018年度水情水害预测预报
- 基本要道讲义
- 机电设备安装试运行异常现象分析与对策
- 《有机化学》复习资料-李月明
- 非常可乐非常MC2--非常可乐广告策划提案 - 图文
- 2011中考数学真题解析4 - 科学记数法(含答案)
- 企业人力资源管理师三级07- 09年真题及答案
- 基于单片机的光控自动窗帘控制系统设计说明书1 - 图文
- 20160802神华九江输煤皮带机安装方案001
- (共53套)新人教版一生物必修2(全册)教案汇总 word打印版
- 2014行政管理学总复习
- 中国银监会关于加强地方政府融资平台贷款风险监管的指导意见
- 民宿酒店核心竞争与研究
- 游园活动谜语大全2012
- 河南省天一大联考2016届高三英语5月阶段性测试试题(六)(A卷)
- 小型超市管理系统毕业论文详细设计4
- 伍景玉答公务员考试技巧
- 2016浙大远程教育计算机应用基础作业-2
- 学校校本研修管理制度
- 2017一级建造师考试《公路工程》真题及答案(完整版) - 图文
- HP工作站 BIOS说明 适用Z228 Z440 Z230 Z640 Z840 Z800 Z620 Z42
- 综合单元测试 - Level 2 Unit 7
- 公司加班及餐补管理制度
- 堕胎过多会导致不孕么?
- 关于全县文化工作情况的汇报
- 中式面点师中级理论知识试卷
- XX县贯彻实施《食品安全法》工作情况汇报
- 2018年湖北省黄冈市中考生物试题(word版,含答案)
- 北京市高等数学竞赛真题(第十二届至第二十一届)
- 供配电课程设计报告
- 工作心得:加强机关党务干部队伍建设的三点思考
- 国家科技攻关计划项目-可行性研究报告
- 江西理工大学12-13-1电工A
- 《中国现代文学专题》期末复习指导 - 综合练习题及答案
- 中国机动车污染防治行业运营格局及未来五年投资态势研究报告
- 工伤事故报告统计分析制度