Static Embeddings Tutorial

This tutorial shows how to train encoding models using static word embeddings with the LeBel assembly. Static embeddings provide pre-trained word representations that can be highly predictive of brain activity.

Overview

Static embeddings capture semantic relationships between words using pre-trained models like Word2Vec or GloVe. These embeddings provide rich semantic representations that can be highly predictive of brain activity.

Key Components

  • Assembly: Pre-packaged LeBel assembly containing brain data and stimuli

  • Feature Extractor: StaticEmbeddingFeatureExtractor using pre-trained embeddings

  • Embedding Models: Word2Vec, GloVe, or other static embedding models

  • Downsampler: Aligns word-level features with brain data timing

  • Model: Ridge regression with nested cross-validation

  • Trainer: AbstractTrainer orchestrates the entire pipeline

Step-by-Step Tutorial

  1. Load the Assembly

    from encoding.assembly.assembly_loader import load_assembly
    
    # Load the pre-packaged LeBel assembly
    assembly = load_assembly("assembly_lebel_uts03.pkl")
    
  2. Create Static Embedding Feature Extractor

    from encoding.features.factory import FeatureExtractorFactory
    
    # You need to provide the path to your embedding file
    vector_path = "/path/to/your/embeddings.bin.gz"  # Replace with your path
    
    extractor = FeatureExtractorFactory.create_extractor(
        modality="embeddings",
        model_name="word2vec",  # Can be "word2vec", "glove", or any identifier
        config={
            "vector_path": vector_path,
            "binary": True,  # Set to True for .bin files, False for .txt files
            "lowercase": False,  # Set to True if your embeddings expect lowercase tokens
            "oov_handling": "copy_prev",  # How to handle out-of-vocabulary words
            "use_tqdm": True,  # Show progress bar
        },
        cache_dir="cache",
    )
    
  3. Set Up Downsampler and Model

    from encoding.downsample.downsampling import Downsampler
    from encoding.models.nested_cv import NestedCVModel
    
    downsampler = Downsampler()
    model = NestedCVModel(model_name="ridge_regression")
    
  4. Configure Training Parameters

    # FIR delays for hemodynamic response modeling
    fir_delays = [1, 2, 3, 4]
    
    # Trimming configuration for LeBel dataset
    trimming_config = {
        "train_features_start": 10,
        "train_features_end": -5,
        "train_targets_start": 0,
        "train_targets_end": None,
        "test_features_start": 50,
        "test_features_end": -5,
        "test_targets_start": 40,
        "test_targets_end": None,
    }
    
    downsample_config = {}
    
  5. Create and Run Trainer

    from encoding.trainer import AbstractTrainer
    
    trainer = AbstractTrainer(
        assembly=assembly,
        feature_extractors=[extractor],
        downsampler=downsampler,
        model=model,
        fir_delays=fir_delays,
        trimming_config=trimming_config,
        use_train_test_split=True,
        logger_backend="wandb",
        wandb_project_name="lebel-embeddings",
        dataset_type="lebel",
        results_dir="results",
        downsample_config=downsample_config,
    )
    
    metrics = trainer.train()
    print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}")
    

Understanding Static Embeddings

Key Parameters

  • modality: “embeddings” - specifies the feature type

  • model_name: “word2vec” - identifier for the extractor

  • vector_path: Path to the embedding file

  • binary: True for .bin files, False for .txt files

  • lowercase: Whether to lowercase tokens before lookup

  • oov_handling: How to handle out-of-vocabulary words

  • use_tqdm: Whether to show progress bar

  • cache_dir: “cache” - directory for caching

Embedding Models

Supported embedding models include: - Word2Vec: Google News vectors, custom Word2Vec models - GloVe: Stanford GloVe embeddings - Custom embeddings: Any compatible embedding format

File Formats

Supported file formats: - Binary files (.bin): Set binary=True - Text files (.txt): Set binary=False - Compressed files (.gz): Automatically handled

OOV Handling

Out-of-vocabulary (OOV) word handling strategies: - “copy_prev”: Use the previous word’s embedding - “zero”: Use zero vector - “random”: Use random vector - “mean”: Use mean of all embeddings

Choose based on your research question and data characteristics.

Training Configuration

  • fir_delays: [1, 2, 3, 4] - temporal delays for hemodynamic response

  • trimming_config: LeBel-specific trimming to avoid boundary effects

  • downsample_config: {} - no additional downsampling configuration needed