encoding.features
- class encoding.features.LanguageModelFeatureExtractor(config: Dict[str, Any])[source]
Feature extractor that uses HookedTransformer to extract embeddings from text.
This extractor supports different language models and can extract features from either the last token or average across all tokens. It now supports multi-layer extraction with lazy loading.
- __init__(config: Dict[str, Any])[source]
Initialize the language model feature extractor.
- Parameters:
config (Dict[str, Any]) – Configuration dictionary containing: - model_name (str): Name of the language model to use - layer_idx (int): Index of the layer to extract features from (for backward compatibility) - hook_type (str): Type of hook to use (default: “hook_resid_pre”) - last_token (bool): Whether to use only the last token’s features - device (str): Device to run the model on (‘cuda’ or ‘cpu’) - context_type (str): Type of context to use (fullcontext, nocontext, halfcontext)
- extract_features(stimuli: str | List[str], layer_idx: int | None = None, **kwargs) numpy.ndarray [source]
Extract features from the input stimuli using a for loop.
- Parameters:
stimuli (Union[str, List[str]]) – Input text or list of texts
layer_idx (Optional[int]) – Specific layer to extract from. If None, uses self.layer_idx
**kwargs – Additional arguments for feature extraction
- Returns:
Extracted features
- Return type:
np.ndarray
- extract_all_layers(stimuli: str | List[str], **kwargs) Dict[int, numpy.ndarray] [source]
Extract features from all layers for the input stimuli.
- Parameters:
stimuli (Union[str, List[str]]) – Input text or list of texts
**kwargs – Additional arguments for feature extraction
- Returns:
Dictionary mapping layer indices to features
- Return type:
Dict[int, np.ndarray]
- class encoding.features.SpeechFeatureExtractor(model_name: str, chunk_size: float, context_size: float, layer: str | int = 'last', pool: str = 'last', device: str | None = None, target_sample_rate: int = 16000, disable_tqdm: bool = False)[source]
Unified feature extractor for HF speech models (Whisper encoder, HuBERT, Wav2Vec2).
extract_features(wav_path, layer=None) -> (features [n_chunks, D], times [n_chunks])
extract_all_layers(wav_path) -> (layer_to_features {idx: [n_chunks, D]}, times [n_chunks])
Notes
Pooling over encoder time can be ‘last’ or ‘mean’.
For Whisper, we call the ENCODER ONLY (model.get_encoder()).
‘layer’ indices are 0-based over encoder blocks (exclude embeddings).
- class encoding.features.WordRateFeatureExtractor(config: Dict[str, Any])[source]
Feature extractor for pre-computed word rate features.
- class encoding.features.StaticEmbeddingFeatureExtractor(config: Dict[str, Any])[source]
Local-only static token embedding extractor (Word2Vec / GloVe).
- Input (extract_features):
List[str]: list of tokens/words (preferred), order preserved
str: a raw string (will be tokenized using tokenizer_pattern)
- Output:
np.ndarray with shape [N, D], one row per input token.
- Config (Dict[str, Any]):
- lowercase (bool): lowercase tokens before lookup
(GoogleNews: False; GloVe/Wiki-Giga: True) [default: True]
- oov_handling (str): one of:
“copy_prev” -> OOV copies the previous valid embedding (DEFAULT) “zero” -> OOV becomes a zero vector (length preserved) “skip” -> OOV is dropped (length may shrink) “error” -> raise on first OOV
use_tqdm (bool): show progress bar for long inputs [default: True]
mmap (bool): memory-map .kv [default: True]
binary (Optional[bool]): force word2vec binary flag; auto-infer if None
no_header (Optional[bool]): force GloVe no-header; auto-infer if None
l2_normalize_tokens (bool): L2-normalize each token vector [default: False]
- tokenizer_pattern (str): ONLY used if input is a single string.
Default r”[A-Za-z0-9_’]+” (keeps underscores)
Note: This has also been tested with ENG1000. You just have to convert it to the .kv format first. We’ll provide a scrip to do that!
- class encoding.features.FIR(delays: Iterable[int] | None = None, circpad: bool = False)[source]
Finite Impulse Response (FIR) expander for creating delayed feature matrices.
- Usage options:
Static/class usage: FIR.make_delayed(stim, delays, circpad=False)
Instance usage: FIR(delays, circpad).expand(stim)
- delays: Iterable[int] | None = None
- circpad: bool = False
- static make_delayed(stim: numpy.ndarray, delays: Iterable[int], circpad: bool = False) numpy.ndarray [source]
- valid_length(nt: int) int [source]
Number of valid time points (non-padded). With circpad=True, always nt. Without circpad, depends on max shift.
- summary(input_dim: int | None = None, nt: int | None = None) str [source]
Return a readable summary of FIR configuration.
- __init__(delays: Iterable[int] | None = None, circpad: bool = False) None
- class encoding.features.FeatureExtractorFactory[source]
Factory class for creating feature extractors with caching support.
- classmethod create_extractor(modality: str, model_name: str, config: Dict[str, Any], cache_dir: str = 'cache') BaseFeatureExtractor [source]
Create a feature extractor based on modality and model name.
- Parameters:
modality – The type of feature extractor (‘language_model’, ‘speech’, ‘wordrate’, ‘embeddings’)
model_name – The specific model name (e.g., ‘gpt2-small’, ‘word2vec’, ‘openai/whisper-tiny’)
config – Configuration dictionary for the extractor
cache_dir – Directory for caching
- Returns:
The appropriate feature extractor instance
- Return type:
BaseFeatureExtractor
- Raises:
ValueError – If modality is not supported
- classmethod extract_features_with_caching(extractor: BaseFeatureExtractor, assembly: Any, story: str, idx: int, layer_idx: int = 9, lookback: int = 256, dataset_type: str = 'narratives') numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray] [source]
Extract features with caching support.
- Parameters:
extractor – The feature extractor instance
assembly – The assembly containing data
story – Story name
idx – Story index
layer_idx – Layer index for multi-layer extractors
lookback – Number of tokens to look back (for language models)
dataset_type – Type of dataset (e.g., ‘narratives’, ‘lebel’, etc.)
- Returns:
Features array, or (features, times) tuple for speech