fastllm is a full-platform llm acceleration library implemented in pure C++. Support Python calls, chatglm-6B level model single card can reach 10000+token / s, support glm, llama, moss base, mobile phone runs smoothly. Function overview Pure C++ implementation, easy to transplant across platforms, can be directly compiled on Android ARM platform supports NEON instruction set acceleration, X86 platform supports AVX instruction set acceleration, NVIDIA platform supports CUDA acceleration, and the speed of each platform is very fast Support floating point Model (FP32), half-precision model (FP16), quantization model (INT8, INT4) Acceleration supports Batch speed optimization Support stream…

#High #performance #large #model #reasoning #library #fastllm

Leave a Comment

Your email address will not be published. Required fields are marked *