encoding.features

class encoding.features.LanguageModelFeatureExtractor(config: Dict[str, Any])[source]

Feature extractor that uses HookedTransformer to extract embeddings from text.

This extractor supports different language models and can extract features from either the last token or average across all tokens. It now supports multi-layer extraction with lazy loading.

__init__(config: Dict[str, Any])[source]

Initialize the language model feature extractor.

Parameters:

config (Dict[str, Any]) – Configuration dictionary containing: - model_name (str): Name of the language model to use - layer_idx (int): Index of the layer to extract features from (for backward compatibility) - hook_type (str): Type of hook to use (default: “hook_resid_pre”) - last_token (bool): Whether to use only the last token’s features - device (str): Device to run the model on (‘cuda’ or ‘cpu’) - context_type (str): Type of context to use (fullcontext, nocontext, halfcontext)

extract_features(stimuli: str | List[str], layer_idx: int | None = None, **kwargs) numpy.ndarray[source]

Extract features from the input stimuli using a for loop.

Parameters:
  • stimuli (Union[str, List[str]]) – Input text or list of texts

  • layer_idx (Optional[int]) – Specific layer to extract from. If None, uses self.layer_idx

  • **kwargs – Additional arguments for feature extraction

Returns:

Extracted features

Return type:

np.ndarray

extract_all_layers(stimuli: str | List[str], **kwargs) Dict[int, numpy.ndarray][source]

Extract features from all layers for the input stimuli.

Parameters:
  • stimuli (Union[str, List[str]]) – Input text or list of texts

  • **kwargs – Additional arguments for feature extraction

Returns:

Dictionary mapping layer indices to features

Return type:

Dict[int, np.ndarray]

class encoding.features.SpeechFeatureExtractor(model_name: str, chunk_size: float, context_size: float, layer: str | int = 'last', pool: str = 'last', device: str | None = None, target_sample_rate: int = 16000, disable_tqdm: bool = False)[source]

Unified feature extractor for HF speech models (Whisper encoder, HuBERT, Wav2Vec2).

  • extract_features(wav_path, layer=None) -> (features [n_chunks, D], times [n_chunks])

  • extract_all_layers(wav_path) -> (layer_to_features {idx: [n_chunks, D]}, times [n_chunks])

Notes

  • Pooling over encoder time can be ‘last’ or ‘mean’.

  • For Whisper, we call the ENCODER ONLY (model.get_encoder()).

  • ‘layer’ indices are 0-based over encoder blocks (exclude embeddings).

__init__(model_name: str, chunk_size: float, context_size: float, layer: str | int = 'last', pool: str = 'last', device: str | None = None, target_sample_rate: int = 16000, disable_tqdm: bool = False)[source]
extract_features(*args, **kwargs)[source]
extract_all_layers(*args, **kwargs)[source]
class encoding.features.WordRateFeatureExtractor(config: Dict[str, Any])[source]

Feature extractor for pre-computed word rate features.

__init__(config: Dict[str, Any])[source]

Initialize the feature extractor with configuration.

Parameters:

config (Dict[str, Any]) – Configuration dictionary containing extractor parameters

extract_features(stimuli: numpy.ndarray, **kwargs) numpy.ndarray[source]

Return pre-computed word rate features.

Parameters:

stimuli – Pre-computed word rate array

Returns:

Word rate features with shape (n_timepoints, 1)

Return type:

np.ndarray

class encoding.features.StaticEmbeddingFeatureExtractor(config: Dict[str, Any])[source]

Local-only static token embedding extractor (Word2Vec / GloVe).

Input (extract_features):
  • List[str]: list of tokens/words (preferred), order preserved

  • str: a raw string (will be tokenized using tokenizer_pattern)

Output:
  • np.ndarray with shape [N, D], one row per input token.

Config (Dict[str, Any]):
  • vector_path (str, required): local vectors path. Supported:

    *.kv -> KeyedVectors.load (mmap capable) *.bin / *.bin.gz -> word2vec binary (binary=True) *.w2v.txt -> word2vec text WITH header (binary=False, no_header=False) *.txt / *.txt.gz -> GloVe text WITHOUT header (binary=False, no_header=True)

  • lowercase (bool): lowercase tokens before lookup

    (GoogleNews: False; GloVe/Wiki-Giga: True) [default: True]

  • oov_handling (str): one of:

    “copy_prev” -> OOV copies the previous valid embedding (DEFAULT) “zero” -> OOV becomes a zero vector (length preserved) “skip” -> OOV is dropped (length may shrink) “error” -> raise on first OOV

  • use_tqdm (bool): show progress bar for long inputs [default: True]

  • mmap (bool): memory-map .kv [default: True]

  • binary (Optional[bool]): force word2vec binary flag; auto-infer if None

  • no_header (Optional[bool]): force GloVe no-header; auto-infer if None

  • l2_normalize_tokens (bool): L2-normalize each token vector [default: False]

  • tokenizer_pattern (str): ONLY used if input is a single string.

    Default r”[A-Za-z0-9_’]+” (keeps underscores)

    Note: This has also been tested with ENG1000. You just have to convert it to the .kv format first. We’ll provide a scrip to do that!

__init__(config: Dict[str, Any])[source]

Initialize the feature extractor with configuration.

Parameters:

config (Dict[str, Any]) – Configuration dictionary containing extractor parameters

extract_features(stimuli: str | List[str], **kwargs) numpy.ndarray[source]

Tokens -> [N, D], one row per input token. If stimuli is a string, it is tokenized. OOV handling per config (default: copy previous valid embedding).

class encoding.features.FIR(delays: Iterable[int] | None = None, circpad: bool = False)[source]

Finite Impulse Response (FIR) expander for creating delayed feature matrices.

Usage options:
  • Static/class usage: FIR.make_delayed(stim, delays, circpad=False)

  • Instance usage: FIR(delays, circpad).expand(stim)

delays: Iterable[int] | None = None
circpad: bool = False
expand(stim: numpy.ndarray) numpy.ndarray[source]
static make_delayed(stim: numpy.ndarray, delays: Iterable[int], circpad: bool = False) numpy.ndarray[source]
n_delays() int[source]

Return the number of delays used.

output_dim(input_dim: int) int[source]

Return the output dimensionality after FIR expansion.

valid_length(nt: int) int[source]

Number of valid time points (non-padded). With circpad=True, always nt. Without circpad, depends on max shift.

summary(input_dim: int | None = None, nt: int | None = None) str[source]

Return a readable summary of FIR configuration.

__init__(delays: Iterable[int] | None = None, circpad: bool = False) None
class encoding.features.FeatureExtractorFactory[source]

Factory class for creating feature extractors with caching support.

classmethod create_extractor(modality: str, model_name: str, config: Dict[str, Any], cache_dir: str = 'cache') BaseFeatureExtractor[source]

Create a feature extractor based on modality and model name.

Parameters:
  • modality – The type of feature extractor (‘language_model’, ‘speech’, ‘wordrate’, ‘embeddings’)

  • model_name – The specific model name (e.g., ‘gpt2-small’, ‘word2vec’, ‘openai/whisper-tiny’)

  • config – Configuration dictionary for the extractor

  • cache_dir – Directory for caching

Returns:

The appropriate feature extractor instance

Return type:

BaseFeatureExtractor

Raises:

ValueError – If modality is not supported

classmethod extract_features_with_caching(extractor: BaseFeatureExtractor, assembly: Any, story: str, idx: int, layer_idx: int = 9, lookback: int = 256, dataset_type: str = 'narratives') numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray][source]

Extract features with caching support.

Parameters:
  • extractor – The feature extractor instance

  • assembly – The assembly containing data

  • story – Story name

  • idx – Story index

  • layer_idx – Layer index for multi-layer extractors

  • lookback – Number of tokens to look back (for language models)

  • dataset_type – Type of dataset (e.g., ‘narratives’, ‘lebel’, etc.)

Returns:

Features array, or (features, times) tuple for speech

classmethod get_supported_modalities() list[source]

Get list of supported modalities.

classmethod register_extractor(modality: str, extractor_class: type)[source]

Register a new feature extractor class.

Parameters:
  • modality – The modality name

  • extractor_class – The extractor class to register