encoding.features

class encoding.features.LanguageModelFeatureExtractor(config: Dict[str, Any])[source]

Feature extractor that uses HookedTransformer to extract embeddings from text.

This extractor supports different language models and can extract features from either the last token or average across all tokens. It now supports multi-layer extraction with lazy loading.

__init__(config: Dict[str, Any])[source]

Initialize the language model feature extractor.

Parameters:: config (Dict[str, Any]) – Configuration dictionary containing: - model_name (str): Name of the language model to use - layer_idx (int): Index of the layer to extract features from (for backward compatibility) - hook_type (str): Type of hook to use (default: “hook_resid_pre”) - last_token (bool): Whether to use only the last token’s features - device (str): Device to run the model on (‘cuda’ or ‘cpu’) - context_type (str): Type of context to use (fullcontext, nocontext, halfcontext)

extract_features(stimuli: str | List[str], layer_idx: int | None = None, **kwargs) → numpy.ndarray[source]

Extract features from the input stimuli using a for loop.

Parameters:

stimuli (Union[str, List[str]]) – Input text or list of texts
layer_idx (Optional[int]) – Specific layer to extract from. If None, uses self.layer_idx
**kwargs – Additional arguments for feature extraction

Returns:

Extracted features

Return type:

np.ndarray

extract_all_layers(stimuli: str | List[str], **kwargs) → Dict[int, numpy.ndarray][source]

Extract features from all layers for the input stimuli.

Parameters:

stimuli (Union[str, List[str]]) – Input text or list of texts
**kwargs – Additional arguments for feature extraction

Returns:

Dictionary mapping layer indices to features

Return type:

Dict[int, np.ndarray]

class encoding.features.SpeechFeatureExtractor(model_name: str, chunk_size: float, context_size: float, layer: str | int = 'last', pool: str = 'last', device: str | None = None, target_sample_rate: int = 16000, disable_tqdm: bool = False)[source]

Unified feature extractor for HF speech models (Whisper encoder, HuBERT, Wav2Vec2).

extract_features(wav_path, layer=None) -> (features [n_chunks, D], times [n_chunks])
extract_all_layers(wav_path) -> (layer_to_features {idx: [n_chunks, D]}, times [n_chunks])

Notes

Pooling over encoder time can be ‘last’ or ‘mean’.
For Whisper, we call the ENCODER ONLY (model.get_encoder()).
‘layer’ indices are 0-based over encoder blocks (exclude embeddings).

__init__(model_name: str, chunk_size: float, context_size: float, layer: str | int = 'last', pool: str = 'last', device: str | None = None, target_sample_rate: int = 16000, disable_tqdm: bool = False)[source]

extract_features(*args, **kwargs)[source]

extract_all_layers(*args, **kwargs)[source]

class encoding.features.WordRateFeatureExtractor(config: Dict[str, Any])[source]

Feature extractor for pre-computed word rate features.

__init__(config: Dict[str, Any])[source]

Initialize the feature extractor with configuration.

Parameters:: config (Dict[str, Any]) – Configuration dictionary containing extractor parameters

extract_features(stimuli: numpy.ndarray, **kwargs) → numpy.ndarray[source]

Return pre-computed word rate features.

Parameters:: stimuli – Pre-computed word rate array
Returns:: Word rate features with shape (n_timepoints, 1)
Return type:: np.ndarray

class encoding.features.StaticEmbeddingFeatureExtractor(config: Dict[str, Any])[source]

Local-only static token embedding extractor (Word2Vec / GloVe).

Input (extract_features):

List[str]: list of tokens/words (preferred), order preserved
str: a raw string (will be tokenized using tokenizer_pattern)

Output:

np.ndarray with shape [N, D], one row per input token.

Config (Dict[str, Any]):

vector_path (str, required): local vectors path. Supported:
*.kv -> KeyedVectors.load (mmap capable) *.bin / *.bin.gz -> word2vec binary (binary=True) *.w2v.txt -> word2vec text WITH header (binary=False, no_header=False) *.txt / *.txt.gz -> GloVe text WITHOUT header (binary=False, no_header=True)
lowercase (bool): lowercase tokens before lookup
(GoogleNews: False; GloVe/Wiki-Giga: True) [default: True]
oov_handling (str): one of:
“copy_prev” -> OOV copies the previous valid embedding (DEFAULT) “zero” -> OOV becomes a zero vector (length preserved) “skip” -> OOV is dropped (length may shrink) “error” -> raise on first OOV
use_tqdm (bool): show progress bar for long inputs [default: True]
mmap (bool): memory-map .kv [default: True]
binary (Optional[bool]): force word2vec binary flag; auto-infer if None
no_header (Optional[bool]): force GloVe no-header; auto-infer if None
l2_normalize_tokens (bool): L2-normalize each token vector [default: False]
tokenizer_pattern (str): ONLY used if input is a single string.
Default r”[A-Za-z0-9_’]+” (keeps underscores)

Note: This has also been tested with ENG1000. You just have to convert it to the .kv format first. We’ll provide a scrip to do that!

__init__(config: Dict[str, Any])[source]

Initialize the feature extractor with configuration.

Parameters:: config (Dict[str, Any]) – Configuration dictionary containing extractor parameters

extract_features(stimuli: str | List[str], **kwargs) → numpy.ndarray[source]: Tokens -> [N, D], one row per input token. If stimuli is a string, it is tokenized. OOV handling per config (default: copy previous valid embedding).

class encoding.features.FIR(delays: Iterable[int] | None = None, circpad: bool = False)[source]

Finite Impulse Response (FIR) expander for creating delayed feature matrices.

Usage options:

Static/class usage: FIR.make_delayed(stim, delays, circpad=False)
Instance usage: FIR(delays, circpad).expand(stim)

delays: Iterable[int] | None = None

circpad: bool = False

expand(stim: numpy.ndarray) → numpy.ndarray[source]

static make_delayed(stim: numpy.ndarray, delays: Iterable[int], circpad: bool = False) → numpy.ndarray[source]

n_delays() → int[source]: Return the number of delays used.

output_dim(input_dim: int) → int[source]: Return the output dimensionality after FIR expansion.

valid_length(nt: int) → int[source]: Number of valid time points (non-padded). With circpad=True, always nt. Without circpad, depends on max shift.

summary(input_dim: int | None = None, nt: int | None = None) → str[source]: Return a readable summary of FIR configuration.

__init__(delays: Iterable[int] | None = None, circpad: bool = False) → None

class encoding.features.FeatureExtractorFactory[source]

Factory class for creating feature extractors with caching support.

classmethod create_extractor(modality: str, model_name: str, config: Dict[str, Any], cache_dir: str = 'cache') → BaseFeatureExtractor[source]

Create a feature extractor based on modality and model name.

Parameters:

modality – The type of feature extractor (‘language_model’, ‘speech’, ‘wordrate’, ‘embeddings’)
model_name – The specific model name (e.g., ‘gpt2-small’, ‘word2vec’, ‘openai/whisper-tiny’)
config – Configuration dictionary for the extractor
cache_dir – Directory for caching

Returns:

The appropriate feature extractor instance

Return type:

BaseFeatureExtractor

Raises:

ValueError – If modality is not supported

classmethod extract_features_with_caching(extractor: BaseFeatureExtractor, assembly: Any, story: str, idx: int, layer_idx: int = 9, lookback: int = 256, dataset_type: str = 'narratives') → numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray][source]

Extract features with caching support.

Parameters:

extractor – The feature extractor instance
assembly – The assembly containing data
story – Story name
idx – Story index
layer_idx – Layer index for multi-layer extractors
lookback – Number of tokens to look back (for language models)
dataset_type – Type of dataset (e.g., ‘narratives’, ‘lebel’, etc.)

Returns:

Features array, or (features, times) tuple for speech

classmethod get_supported_modalities() → list[source]: Get list of supported modalities.

classmethod register_extractor(modality: str, extractor_class: type)[source]

Register a new feature extractor class.

Parameters:

modality – The modality name
extractor_class – The extractor class to register