Static Embeddings Tutorial ========================== This tutorial shows how to train encoding models using static word embeddings with the LeBel assembly. Static embeddings provide pre-trained word representations that can be highly predictive of brain activity. Overview -------- Static embeddings capture semantic relationships between words using pre-trained models like Word2Vec or GloVe. These embeddings provide rich semantic representations that can be highly predictive of brain activity. Key Components -------------- - **Assembly**: Pre-packaged LeBel assembly containing brain data and stimuli - **Feature Extractor**: StaticEmbeddingFeatureExtractor using pre-trained embeddings - **Embedding Models**: Word2Vec, GloVe, or other static embedding models - **Downsampler**: Aligns word-level features with brain data timing - **Model**: Ridge regression with nested cross-validation - **Trainer**: AbstractTrainer orchestrates the entire pipeline Step-by-Step Tutorial --------------------- 1. **Load the Assembly** .. code-block:: python from encoding.assembly.assembly_loader import load_assembly # Load the pre-packaged LeBel assembly assembly = load_assembly("assembly_lebel_uts03.pkl") 2. **Create Static Embedding Feature Extractor** .. code-block:: python from encoding.features.factory import FeatureExtractorFactory # You need to provide the path to your embedding file vector_path = "/path/to/your/embeddings.bin.gz" # Replace with your path extractor = FeatureExtractorFactory.create_extractor( modality="embeddings", model_name="word2vec", # Can be "word2vec", "glove", or any identifier config={ "vector_path": vector_path, "binary": True, # Set to True for .bin files, False for .txt files "lowercase": False, # Set to True if your embeddings expect lowercase tokens "oov_handling": "copy_prev", # How to handle out-of-vocabulary words "use_tqdm": True, # Show progress bar }, cache_dir="cache", ) 3. **Set Up Downsampler and Model** .. code-block:: python from encoding.downsample.downsampling import Downsampler from encoding.models.nested_cv import NestedCVModel downsampler = Downsampler() model = NestedCVModel(model_name="ridge_regression") 4. **Configure Training Parameters** .. code-block:: python # FIR delays for hemodynamic response modeling fir_delays = [1, 2, 3, 4] # Trimming configuration for LeBel dataset trimming_config = { "train_features_start": 10, "train_features_end": -5, "train_targets_start": 0, "train_targets_end": None, "test_features_start": 50, "test_features_end": -5, "test_targets_start": 40, "test_targets_end": None, } downsample_config = {} 5. **Create and Run Trainer** .. code-block:: python from encoding.trainer import AbstractTrainer trainer = AbstractTrainer( assembly=assembly, feature_extractors=[extractor], downsampler=downsampler, model=model, fir_delays=fir_delays, trimming_config=trimming_config, use_train_test_split=True, logger_backend="wandb", wandb_project_name="lebel-embeddings", dataset_type="lebel", results_dir="results", downsample_config=downsample_config, ) metrics = trainer.train() print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") Understanding Static Embeddings ------------------------------- Key Parameters -------------- - **modality**: "embeddings" - specifies the feature type - **model_name**: "word2vec" - identifier for the extractor - **vector_path**: Path to the embedding file - **binary**: True for .bin files, False for .txt files - **lowercase**: Whether to lowercase tokens before lookup - **oov_handling**: How to handle out-of-vocabulary words - **use_tqdm**: Whether to show progress bar - **cache_dir**: "cache" - directory for caching Embedding Models ---------------- Supported embedding models include: - **Word2Vec**: Google News vectors, custom Word2Vec models - **GloVe**: Stanford GloVe embeddings - **Custom embeddings**: Any compatible embedding format File Formats ------------ Supported file formats: - **Binary files (.bin)**: Set `binary=True` - **Text files (.txt)**: Set `binary=False` - **Compressed files (.gz)**: Automatically handled OOV Handling ------------ Out-of-vocabulary (OOV) word handling strategies: - **"copy_prev"**: Use the previous word's embedding - **"zero"**: Use zero vector - **"random"**: Use random vector - **"mean"**: Use mean of all embeddings Choose based on your research question and data characteristics. Training Configuration ---------------------- - **fir_delays**: [1, 2, 3, 4] - temporal delays for hemodynamic response - **trimming_config**: LeBel-specific trimming to avoid boundary effects - **downsample_config**: {} - no additional downsampling configuration needed