LLM Training in Scala

This project is an implementation of a Language Model (LLM) training framework in Scala. It provides a set of modules and utilities for building, training, and evaluating language models using the transformer architecture.

Inspired by the llm.c project, this Scala version aims to provide a clean, efficient, and extensible codebase for training language models.

Features

Transformer-based language model architecture
Multi-head self-attention mechanism
Positional encoding for sequence information
Feed-forward neural network layers
Embedding layer for input tokens
Layer normalization for stable training
GELU activation function
Adam optimizer for parameter updates
Data loading and batching utilities
Tokenization and vocabulary handling
Test suite for all modules

Project Structure

The project follows a standard Scala project structure:

llm-training/
  ├── build.sbt
  └── src/
      ├── main/
      │   └── scala/
      │       └── llm/
      │           ├── Config.scala
      │           ├── Model.scala
      │           ├── Attention.scala
      │           ├── LayerNorm.scala
      │           ├── GELU.scala
      │           ├── Embedding.scala
      │           ├── PositionalEncoding.scala
      │           ├── FeedForward.scala
      │           ├── Transformer.scala
      │           ├── Optimizer.scala
      │           ├── DataLoader.scala
      │           ├── Tokenizer.scala
      │           ├── Utils.scala
      │           └── Main.scala
      └── test/
          └── scala/
              └── llm/
                  ├── ModelSpec.scala
                  ├── AttentionSpec.scala
                  ├── LayerNormSpec.scala
                  ├── GELUSpec.scala
                  ├── EmbeddingSpec.scala
                  ├── PositionalEncodingSpec.scala
                  ├── FeedForwardSpec.scala
                  ├── TransformerSpec.scala
                  ├── OptimizerSpec.scala
                  ├── DataLoaderSpec.scala
                  ├── TokenizerSpec.scala
                  └── UtilsSpec.scala

src/main/scala/llm/: Contains the main source code for the language model implementation.
src/test/scala/llm/: Contains the test specifications for each module.
build.sbt: The build configuration file for the Scala project.
project/: Contains the sbt version and plugin configuration.

Getting Started

Prerequisites

Scala 2.13.8
sbt 1.5.5

Installation

Clone the repository:

git clone https://github.com/wassemgtk/llm.scala.git

Navigate to the project directory:
```
cd llm-training
```
Compile the project:
```
sbt compile
```

Training

To train the language model, follow these steps:

Prepare your training data:
- Place your training data file (e.g., tiny_shakespeare_train.bin) in the data/ directory.
- Update the dataFile value in Main.scala to point to your training data file.
Configure the model hyperparameters in the Config case class in Config.scala.
Run the training script:
```
sbt run
```
Monitor the training progress and metrics logged to the console.

Text Generation

To generate text using a trained model, follow these steps:

Make sure you have a trained model checkpoint in the checkpoints/ directory.
Update the modelCheckpoint value in Main.scala to point to your trained model checkpoint file.
Set the desired generation parameters (e.g., maxNewTokens, temperature) in the Main object.
Run the text generation script:
```
sbt run
```
The generated text will be printed to the console.

Testing

To run the test suite and ensure the correctness of the implemented modules, use the following command:

sbt test

This will execute all the test specifications in the src/test/scala/llm/ directory.

Configuration

The Config case class in src/main/scala/llm/Config.scala contains the hyperparameters and configuration settings for the language model. You can modify these values to experiment with different model architectures and training setups.

Model Checkpointing

During training, the model checkpoints will be saved in the checkpoints/ directory. You can use these checkpoints to resume training from a previous state or to generate text using a trained model.

Logging

The project uses the Logback logging library for logging purposes. You can configure the logging settings in the src/main/resources/logback.xml file.

Contributing

Contributions to this project are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License.

Acknowledgments

This project is inspired by the llm.c project by Andrej Karpathy.
The transformer architecture is based on the paper "Attention Is All You Need" by Vaswani et al.
The implementation draws inspiration from various open-source language model implementations in the Scala ecosystem.

llm.scala
llm.scala copied to clipboard

Metadata

LLM Training in Scala

Features

Project Structure

Getting Started

Prerequisites

Installation

Training

Text Generation

Testing

Configuration

Model Checkpointing

Logging

Contributing

License

Acknowledgments

← Metadata

Owner

Metadata

llm.scala llm.scala copied to clipboard

Metadata

LLM Training in Scala

Features

Project Structure

Getting Started

Prerequisites

Installation

Training

Text Generation

Testing

Configuration

Model Checkpointing

Logging

Contributing

License

Acknowledgments

← Metadata

Owner

Metadata

llm.scala
llm.scala copied to clipboard