News

Research has shown that large language models (LLMs) tend to overemphasize information at the beginning and end of a document or conversation, while neglecting the middle.
They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on the specific distribution pattern of activations.