alpbench.util.pytorch_tabnet.utils

Functions

check_embedding_parameters(cat_dims, ...)

Check parameters related to embeddings and rearrange them in a unique manner.

check_input(X)

Raise a clear error if X is a pandas dataframe and check array according to scikit rules

check_list_groups(list_groups, input_dim)

Check that list groups:

check_warm_start(warm_start, from_unsupervised)

Gives a warning about ambiguous usage of the two parameters.

create_dataloaders(X_train, y_train, ...)

Create dataloaders with or without subsampling depending on weights and balanced.

create_explain_matrix(input_dim, ...)

This is a computational trick.

create_group_matrix(list_groups, input_dim)

Create the group matrix corresponding to the given list_groups

create_sampler(weights, y_train)

This creates a sampler from the given weights

define_device(device_name)

Define the device to use during training and inference.

filter_weights(weights)

This function makes sure that weights are in correct format for regression and multitask TabNet

validate_eval_set(eval_set, eval_name, ...)

Check if the shapes of eval_set are compatible with (X_train, y_train).

Classes

ComplexEncoder(*[, skipkeys, ensure_ascii, ...])

PredictDataset(x)

Format for numpy array

SparsePredictDataset(x)

Format for csr_matrix

SparseTorchDataset(x, y)

Format for csr_matrix

TorchDataset(x, y)

Format for numpy array

class alpbench.util.pytorch_tabnet.utils.ComplexEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

default(obj)[source]
class alpbench.util.pytorch_tabnet.utils.PredictDataset(x)[source]

Bases: Dataset

Format for numpy array

Parameters:

X (2D array) – The input matrix

class alpbench.util.pytorch_tabnet.utils.SparsePredictDataset(x)[source]

Bases: Dataset

Format for csr_matrix

Parameters:

X (CSR matrix) – The input matrix

class alpbench.util.pytorch_tabnet.utils.SparseTorchDataset(x, y)[source]

Bases: Dataset

Format for csr_matrix

Parameters:
  • X (CSR matrix) – The input matrix

  • y (2D array) – The one-hot encoded target

class alpbench.util.pytorch_tabnet.utils.TorchDataset(x, y)[source]

Bases: Dataset

Format for numpy array

Parameters:
  • X (2D array) – The input matrix

  • y (2D array) – The one-hot encoded target

alpbench.util.pytorch_tabnet.utils.check_embedding_parameters(cat_dims, cat_idxs, cat_emb_dim)[source]

Check parameters related to embeddings and rearrange them in a unique manner.

alpbench.util.pytorch_tabnet.utils.check_input(X)[source]

Raise a clear error if X is a pandas dataframe and check array according to scikit rules

alpbench.util.pytorch_tabnet.utils.check_list_groups(list_groups, input_dim)[source]
Check that list groups:
  • is a list of list

  • does not contain twice the same feature in different groups

  • does not contain unknown features (>= input_dim)

  • does not contain empty groups

Parameters:
  • list_groups (-) – Each element is a list representing features in the same group. One feature should appear in maximum one group. Feature that don’t get assign a group will be in their own group of one feature.

  • input_dim (-) –

alpbench.util.pytorch_tabnet.utils.check_warm_start(warm_start, from_unsupervised)[source]

Gives a warning about ambiguous usage of the two parameters.

alpbench.util.pytorch_tabnet.utils.create_dataloaders(X_train, y_train, eval_set, weights, batch_size, num_workers, drop_last, pin_memory)[source]

Create dataloaders with or without subsampling depending on weights and balanced.

Parameters:
  • X_train (np.ndarray) – Training data

  • y_train (np.array) – Mapped Training targets

  • eval_set (list of tuple) – List of eval tuple set (X, y)

  • weights (either 0, 1, dict or iterable) –

    if 0 (default) : no weights will be applied if 1 : classification only, will balanced class with inverse frequency if dict : keys are corresponding class values are sample weights if iterable : list or np array must be of length equal to nb elements

    in the training set

  • batch_size (int) – how many samples per batch to load

  • num_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process

  • drop_last (bool) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller

  • pin_memory (bool) – Whether to pin GPU memory during training

Returns:

train_dataloader, valid_dataloader – Training and validation dataloaders

Return type:

torch.DataLoader, torch.DataLoader

alpbench.util.pytorch_tabnet.utils.create_explain_matrix(input_dim, cat_emb_dim, cat_idxs, post_embed_dim)[source]

This is a computational trick. In order to rapidly sum importances from same embeddings to the initial index.

Parameters:
  • input_dim (int) – Initial input dim

  • cat_emb_dim (int or list of int) – if int : size of embedding for all categorical feature if list of int : size of embedding for each categorical feature

  • cat_idxs (list of int) – Initial position of categorical features

  • post_embed_dim (int) – Post embedding inputs dimension

Returns:

reducing_matrix – Matrix of dim (post_embed_dim, input_dim) to performe reduce

Return type:

np.array

alpbench.util.pytorch_tabnet.utils.create_group_matrix(list_groups, input_dim)[source]

Create the group matrix corresponding to the given list_groups

Parameters:
  • list_groups (-) – Each element is a list representing features in the same group. One feature should appear in maximum one group. Feature that don’t get assigned a group will be in their own group of one feature.

  • input_dim (-) –

Returns:

- group_matrix – A matrix of size (n_groups, input_dim) where m_ij represents the importance of feature j in group i The rows must some to 1 as each group is equally important a priori.

Return type:

torch matrix

alpbench.util.pytorch_tabnet.utils.create_sampler(weights, y_train)[source]

This creates a sampler from the given weights

Parameters:
  • weights (either 0, 1, dict or iterable) –

    if 0 (default) : no weights will be applied if 1 : classification only, will balanced class with inverse frequency if dict : keys are corresponding class values are sample weights if iterable : list or np array must be of length equal to nb elements

    in the training set

  • y_train (np.array) – Training targets

alpbench.util.pytorch_tabnet.utils.define_device(device_name)[source]

Define the device to use during training and inference. If auto it will detect automatically whether to use cuda or cpu

Parameters:

device_name (str) – Either “auto”, “cpu” or “cuda”

Returns:

Either “cpu” or “cuda”

Return type:

str

alpbench.util.pytorch_tabnet.utils.filter_weights(weights)[source]

This function makes sure that weights are in correct format for regression and multitask TabNet

Parameters:

weights (int, dict or list) – Initial weights parameters given by user

Returns:

None

Return type:

This function will only throw an error if format is wrong

alpbench.util.pytorch_tabnet.utils.validate_eval_set(eval_set, eval_name, X_train, y_train)[source]

Check if the shapes of eval_set are compatible with (X_train, y_train).

Parameters:
  • eval_set (list of tuple) – List of eval tuple set (X, y). The last one is used for early stopping

  • eval_name (list of str) – List of eval set names.

  • X_train (np.ndarray) – Train owned products

  • y_train (np.array) – Train targeted products

Returns:

  • eval_names (list of str) – Validated list of eval_names.

  • eval_set (list of tuple) – Validated list of eval_set.