Grok-1 is a mixture of experts model with eight experts and 314 billion parameters. The author of the video notes that while the model has not yet been quantized, they were able to test the unquantized version through X itself.
One of the most notable aspects of this release is the licensing. Grok-1 is released under the Apache 2.0 license, allowing commercial use and opening up a world of possibilities for companies looking to leverage the power of large language models.
Grok repo: https://github.com/xai-org/grok-1 (Open source, Apache 2.0 license)
However, running Grok-1 locally presents a significant challenge due to its size. As Imad, the CEO of Stability, points out, “In order to run this in 4-bit, you will likely need around 320 GB of VRAM, and to run it in 8-bit, you will need a DGX H100 with eight H100s, each having 80 GB of VRAM.” This hefty hardware requirement may limit the accessibility of Grok-1 for some users.

The evaluation results demonstrate the impressive performance improvements achieved with Grok-1 compared to its predecessor Grok-0 and other models in its compute class. Let’s analyze the results for each benchmark:
- GSM8k (middle school math word problems):
Grok-1 achieved a score of 80.7% on the 8-shot prompt, outperforming models like GPT-3.5 (57.1%), LLaMa 2 70B (56.8%), and Inflection-1 (62.9%). It is only surpassed by more resource-intensive models like Claude 2 (88.0%) and GPT-4 (92.0%). - MMLU (multidisciplinary multiple choice questions):
Grok-1 scored 73.0% on the 5-shot in-context examples, surpassing GPT-3.5 (70.0%), LLaMa 2 70B (68.9%), and Inflection-1 (72.7%). Again, it is only outperformed by models with significantly larger training data and compute resources, such as Palm 2 (78.0%) and GPT-4 with chain-of-thought (86.4%). - HumanEval (Python code completion task):
In the zero-shot evaluation for pass@1, Grok-1 achieved an impressive 63.2%, surpassing GPT-3.5 (48.1%), LLaMa 2 70B (29.9%), and Inflection-1 (35.4%). It comes close to the performance of more advanced models like Claude 2 (70%) and GPT-4 (67%). - MATH (middle and high school mathematics problems in LaTeX):
Grok-1 scored 23.9% on the fixed 4-shot prompt, outperforming GPT-3.5 (23.5%), LLaMa 2 70B (13.5%), and Inflection-1 (16.0%). Once again, it is only surpassed by more resource-intensive models like Palm 2 (34.6%) and GPT-4 (42.5%).
These results showcase the significant progress made by xAI in training large language models with exceptional efficiency. Grok-1 consistently outperforms other models in its compute class, including ChatGPT-3.5 and Inflection-1, across various benchmarks that measure math and reasoning abilities. The fact that Grok-1 is only surpassed by models trained with significantly larger amounts of data and compute resources highlights the impressive advancements made in the development of this model.
The key ideas discussed in the video:
- Grok-1 is a large language model developed by Elon Musk’s company, X (formerly Twitter), with 314 billion parameters and eight experts.
- Grok-1 has the unique ability to pull real-time information from X (Twitter), allowing it to stay current with recent events.
- The AI enthusiast tested Grok-1’s capabilities against other models like Gemini, Llama, and ChatGPT.
- Grok-1 performed well in tasks such as writing a Python script to output numbers, solving math problems, and creating JSON data structures.
- However, Grok-1 struggled with writing the game “Snake” in Python, predicting the number of words in its own response, and solving a physics-based logic problem.
- Grok-1 is uncensored, in line with X’s stance on freedom of speech.
- The author is eager to test a quantized version of Grok-1 and see its performance when fine-tuned for specific tasks.
- The video serves as an initial assessment of Grok-1’s capabilities, highlighting its strengths and weaknesses compared to other large language models.