API

peppy

Project configuration, particularly for logging.

Project-scope constants may reside here, but more importantly, some setup here will provide a logging infrastructure for all of the project’s modules. Individual modules and classes may provide separate configuration on a more local level, but this will at least provide a foundation.

class peppy.AttributeDict(entries=None, _force_nulls=False, _attribute_identity=False)[source]

A class to convert a nested mapping(s) into an object(s) with key-values using object syntax (attr_dict.attribute) instead of getitem syntax (attr_dict[“key”]). This class recursively sets mappings to objects, facilitating attribute traversal (e.g., attr_dict.attr.attr).

add_entries(entries)[source]

Update this AttributeDict with provided key-value pairs.

Parameters:object)] | Mapping | pandas.Series entries (Iterable[(object,) – collection of pairs of keys and values
Return AttributeDict:
 the updated instance
copy()

Copy self to a new object.

is_null(item)[source]

Conjunction of presence in underlying mapping and value being None

Parameters:item (object) – Key to check for presence and null value
Return bool:True iff the item is present and has null value
non_null(item)[source]

Conjunction of presence in underlying mapping and value not being None

Parameters:item (object) – Key to check for presence and non-null value
Return bool:True iff the item is present and has non-null value
class peppy.Project(config_file, subproject=None, default_compute=None, dry=False, permissive=True, file_checks=False, compute_env_file=None, no_environment_exception=None, no_compute_exception=None, defer_sample_construction=False)[source]

A class to model a Project (collection of samples and metadata).

Parameters:
  • config_file (str) – Project config file (YAML).
  • subproject (str) – Subproject to use within configuration file, optional
  • default_compute (str) – Configuration file (YAML) for default compute settings.
  • dry (bool) – If dry mode is activated, no directories will be created upon project instantiation.
  • permissive (bool) – Whether a error should be thrown if a sample input file(s) do not exist or cannot be open.
  • file_checks (bool) – Whether sample input files should be checked for their attributes (read type, read length) if this is not set in sample metadata.
  • compute_env_file (str) – Environment configuration YAML file specifying compute settings.
  • no_environment_exception (type) – type of exception to raise if environment settings can’t be established, optional; if null (the default), a warning message will be logged, and no exception will be raised.
  • no_compute_exception (type) – type of exception to raise if compute settings can’t be established, optional; if null (the default), a warning message will be logged, and no exception will be raised.
  • defer_sample_construction (bool) – whether to wait to build this Project’s Sample objects until they’re needed, optional; by default, the basic Sample is created during Project construction
Example:
from models import Project
prj = Project("config.yaml")
exception MissingMetadataException(missing_section, path_config_file=None)[source]

Project needs certain metadata.

exception MissingSampleSheetError(sheetfile)[source]

Represent case in which sample sheet is specified but nonexistent.

activate_subproject(subproject)[source]

Activate a subproject.

This method will update Project attributes, adding new values associated with the subproject indicated, and in case of collision with an existing key/attribute the subproject’s value will be favored.

Parameters:subproject (str) – A string with a subproject name to be activated
Return Project:A Project with the selected subproject activated
build_sheet(*protocols)[source]

Create all Sample object for this project for the given protocol(s).

Return pandas.core.frame.DataFrame:
 DataFrame with from base version of each of this Project’s samples, for indicated protocol(s) if given, else all of this Project’s samples
compute_env_var

Environment variable through which to access compute settings.

Return str:name of the environment variable to pointing to compute settings
constants

Return key-value pairs of pan-Sample constants for this Project.

Return Mapping:collection of KV pairs, each representing a pairing of attribute name and attribute value
copy()

Copy self to a new object.

default_compute_envfile

Path to default compute environment settings file.

Return str:Path to this project’s default compute env config file.
derived_columns

Collection of sample attributes for which value of each is derived from elsewhere

Return list[str]:
 sample attribute names for which value is derived
finalize_pipelines_directory(pipe_path='')[source]

Finalize the establishment of a path to this project’s pipelines.

With the passed argument, override anything already set. Otherwise, prefer path provided in this project’s config, then local pipelines folder, then a location set in project environment.

Parameters:

pipe_path (str) – (absolute) path to pipelines

Raises:
  • PipelinesException – if (prioritized) search in attempt to confirm or set pipelines directory failed
  • TypeError – if pipeline(s) path(s) argument is provided and can’t be interpreted as a single path or as a flat collection of path(s)
get_arg_string(pipeline_name)[source]

For this project, given a pipeline, return an argument string specified in the project config file.

get_sample(sample_name)[source]

Get an individual sample object from the project.

Will raise a ValueError if the sample is not found. In the case of multiple samples with the same name (which is not typically allowed), a warning is raised and the first sample is returned.

Parameters:sample_name (str) – The name of a sample to retrieve
Return Sample:The requested Sample object
get_samples(sample_names)[source]

Returns a list of sample objects given a list of sample names

Parameters:sample_names (list) – A list of sample names to retrieve
Return list[Sample]:
 A list of Sample objects
implied_columns

Collection of sample attributes for which value of each is implied by other(s)

Return list[str]:
 sample attribute names for which value is implied by other(s)
infer_name()[source]

Infer project name from config file path.

First assume the name is the folder in which the config file resides, unless that folder is named “metadata”, in which case the project name is the parent of that folder.

Parameters:path_config_file (str) – path to the project’s config file.
Return str:inferred name for project.
make_project_dirs()[source]

Creates project directory structure if it doesn’t exist.

num_samples

Count the number of samples available in this Project.

Return int:number of samples available in this Project.
output_dir

Directory in which to place results and submissions folders.

By default, assume that the project’s configuration file specifies an output directory, and that this is therefore available within the project metadata. If that assumption does not hold, though, consider the folder in which the project configuration file lives to be the project’s output directory.

Return str:path to the project’s output directory, either as specified in the configuration file or the folder that contains the project’s configuration file.
parse_config_file(subproject=None)[source]

Parse provided yaml config file and check required fields exist.

Parameters:subproject (str) – Name of subproject to activate, optional
Raises:KeyError – if config file lacks required section(s)
static parse_sample_sheet(sample_file, dtype=<type 'str'>)[source]

Check if csv file exists and has all required columns.

Parameters:
  • sample_file (str) – path to sample annotations file.
  • dtype (type) – data type for CSV read.
Raises:
  • IOError – if given annotations file can’t be read.
  • ValueError – if required column(s) is/are missing.
project_folders

Names of folders to nest within a project output directory.

Return Iterable[str]:
 names of output-nested folders
protocols

Determine this Project’s unique protocol names.

Return Set[str]:
 collection of this Project’s unique protocol names
required_metadata

Names of metadata fields that must be present for a valid project.

Make a base project as unconstrained as possible by requiring no specific metadata attributes. It’s likely that some common-sense requirements may arise in domain-specific client applications, in which case this can be redefined in a subclass.

Return Iterable[str]:
 names of metadata fields required by a project
sample_names

Names of samples of which this Project is aware.

samples

Generic/base Sample instance for each of this Project’s samples.

Return Iterable[Sample]:
 Sample instance for each of this Project’s samples
set_compute(setting)[source]

Set the compute attributes according to the specified settings in the environment file.

Parameters:setting (str) – name for non-resource compute bundle, the name of a subsection in an environment configuration file
Return bool:success flag for attempt to establish compute settings
set_project_permissions()[source]

Make the project’s public_html folder executable.

sheet

Annotations/metadata sheet describing this Project’s samples.

Return pandas.core.frame.DataFrame:
 table of samples in this Project
templates_folder

Path to folder with default submission templates.

Return str:path to folder with default submission templates
update_environment(env_settings_file)[source]

Parse data from environment configuration file.

Parameters:env_settings_file (str) – path to file with new environment configuration data
class peppy.Sample(series, prj=None)[source]

Class to model Samples based on a pandas Series.

Parameters:series (Mapping | pandas.core.series.Series) – Sample’s data.
Example:
from models import Project, SampleSheet, Sample
prj = Project("ngs")
sheet = SampleSheet("~/projects/example/sheet.csv", prj)
s1 = Sample(sheet.iloc[0])
as_series()[source]

Returns a pandas.Series object with all the sample’s attributes.

Return pandas.core.series.Series:
 pandas Series representation of this Sample, with its attributes.
check_valid(required=None)[source]

Check provided sample annotation is valid.

Parameters:required (Iterable[str]) – collection of required sample attribute names, optional; if unspecified, only a name is required.
Return (Exception | NoneType, str, str):
 exception and messages about what’s missing/empty; null with empty messages if there was nothing exceptional or required inputs are absent or not set
copy()

Copy self to a new object.

determine_missing_requirements()[source]

Determine which of this Sample’s required attributes/files are missing.

Return (type, str):
 hypothetical exception type along with message about what’s missing; null and empty if nothing exceptional is detected
generate_filename(delimiter='_')[source]

Create a name for file in which to represent this Sample.

This uses knowledge of the instance’s subtype, sandwiching a delimiter between the name of this Sample and the name of the subtype before the extension. If the instance is a base Sample type, then the filename is simply the sample name with an extension.

Parameters:delimiter (str) – what to place between sample name and name of subtype; this is only relevant if the instance is of a subclass
Return str:name for file with which to represent this Sample on disk
generate_name()[source]

Generate name for the sample by joining some of its attribute strings.

get_attr_values(attrlist)[source]

Get value corresponding to each given attribute.

Parameters:attrlist (str) – name of an attribute storing a list of attr names
Return list | NoneType:
 value (or empty string) corresponding to each named attribute; null if this Sample’s value for the attribute given by the argument to the “attrlist” parameter is empty/null, or if this Sample lacks the indicated attribute
get_sheet_dict()[source]

Create a K-V pairs for items originally passed in via the sample sheet.

This is useful for summarizing; it provides a representation of the sample that excludes things like config files and derived entries.

Return OrderedDict:
 mapping from name to value for data elements originally provided via the sample sheet (i.e., the a map-like representation of the instance, excluding derived items)
get_subsample(subsample_name)[source]

Retrieve a single subsample by name.

Parameters:subsample_name (str) – The name of the desired subsample. Should match the subsample_name column in the subannotation sheet.
Return Subsample:
 Requested Subsample object
get_subsamples(subsample_names)[source]

Retrieve subsamples assigned to this sample

Parameters:subsample_names (list) – List of names of subsamples to retrieve
Return list:List of subsamples
infer_attributes(implications)[source]

Infer value for additional field(s) from other field(s).

Add columns/fields to the sample based on values in those already-set that the sample’s project defines as indicative of implications for additional data elements for the sample.

Parameters:implications (Mapping) – Project’s implied columns data
Return None:this function mutates state and is strictly for effect
input_file_paths

List the sample’s data source / input files

Return list[str]:
 paths to data sources / input file for this Sample.
is_dormant()[source]

Determine whether this Sample is inactive.

By default, a Sample is regarded as active. That is, if it lacks an indication about activation status, it’s assumed to be active. If, however, and there’s an indication of such status, it must be ‘1’ in order to be considered switched ‘on.’

Return bool:whether this Sample’s been designated as dormant
library

Backwards-compatible alias.

Return str:The protocol / NGS library name for this Sample.
locate_data_source(data_sources, column_name='data_source', source_key=None, extra_vars=None)[source]

Uses the template path provided in the project config section “data_sources” to piece together an actual path by substituting variables (encoded by “{variable}”“) with sample attributes.

Parameters:
  • data_sources (Mapping) – mapping from key name (as a value in a cell of a tabular data structure) to, e.g., filepath
  • column_name (str) – Name of sample attribute (equivalently, sample sheet column) specifying a derived column.
  • source_key (str) – The key of the data_source, used to index into the project config data_sources section. By default, the source key will be taken as the value of the specified column (as a sample attribute). For cases where the sample doesn’t have this attribute yet (e.g. in a merge table), you must specify the source key.
  • extra_vars (dict) – By default, this will look to populate the template location using attributes found in the current sample; however, you may also provide a dict of extra variables that can also be used for variable replacement. These extra variables are given a higher priority.
Return str:

regex expansion of data source specified in configuration, with variable substitutions made

Raises:

ValueError – if argument to data_sources parameter is null/empty

make_sample_dirs()[source]

Creates sample directory structure if it doesn’t exist.

set_file_paths(project=None)[source]

Sets the paths of all files for this sample.

Parameters:project (AttributeDict) – object with pointers to data paths and such, either full Project or AttributeDict with sufficient data
set_genome(genomes)[source]

Set the genome for this Sample.

Parameters:str] genomes (Mapping[str,) – genome assembly by organism name
set_pipeline_attributes(pipeline_interface, pipeline_name, permissive=True)[source]

Set pipeline-specific sample attributes.

Some sample attributes are relative to a particular pipeline run, like which files should be considered inputs, what is the total input file size for the sample, etc. This function sets these pipeline-specific sample attributes, provided via a PipelineInterface object and the name of a pipeline to select from that interface.

Parameters:
  • pipeline_interface (PipelineInterface) – A PipelineInterface object that has the settings for this given pipeline.
  • pipeline_name (str) – Which pipeline to choose.
  • permissive (bool) – whether to simply log a warning or error message rather than raising an exception if sample file is not found or otherwise cannot be read, default True
set_read_type(rlen_sample_size=10, permissive=True)[source]

For a sample with attr ngs_inputs set, this sets the read type (single, paired) and read length of an input file.

Parameters:
  • rlen_sample_size (int) – Number of reads to sample to infer read type, default 10.
  • permissive (bool) – whether to simply log a warning or error message rather than raising an exception if sample file is not found or otherwise cannot be read, default True.
set_transcriptome(transcriptomes)[source]

Set the transcriptome for this Sample.

Parameters:str] transcriptomes (Mapping[str,) – transcriptome assembly by organism name
to_yaml(path=None, subs_folder_path=None, delimiter='_')[source]

Serializes itself in YAML format.

Parameters:
  • path (str) – A file path to write yaml to; provide this or the subs_folder_path
  • subs_folder_path (str) – path to folder in which to place file that’s being written; provide this or a full filepath
  • delimiter (str) – text to place between the sample name and the suffix within the filename; irrelevant if there’s no suffix
Return str:

filepath used (same as input if given, otherwise the path value that was inferred)

Raises:

ValueError – if neither full filepath nor path to extant parent directory is provided.

update(newdata, **kwargs)[source]

Update Sample object with attributes from a dict.

exception peppy.PeppyError(msg)[source]

Base error type for peppy custom errors.