Quickstart

In [1]:
import impact as impt
C:\Users\Naveen\Anaconda3\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

The impact framework is designed to help scientists parse, interpret, explore and visualize data to understand and engineer micorbial physiology. The core framework is open-source and written entirely in python.

Data is parsed into an object-oriented data structure, built on top of a relational mapping to most sql databases. This allows for efficient saving and querying to ease data exploration.

Here we provide the basics to get started analyzing data with the core framework. Before getting started, it is worthwhile to understand the basic data schema:

Model Function
TrialIdentifier Describes a trial (time, analyte, strain, media, etc.)
AnalyteData Time, data points and vectors for quantified data (g/L product, OD, etc.)
SingleTrial All analytes for a given unit (e.g. a tube, well on plate, bioreactor, etc.)
ReplicateTrial Contains a set of SingleTrials with replicates grouped to calculate statistics
Experiment All of the trials performed on a given date
Project Groups of experiments with overall goals

On import, data will automatically be parsed into this format. In addition, data will most commonly be queried by metadata in the TrialIdentifier which is composed of three main identifiers:

Model Function
Strain Describes the organism being characterized (e.g. strain, knockouts, plasmids, etc.)
Media Described the medium used to characterize the organism (e.g. M9 + 0.02% glc_D)
Environment The conditions and labware used (e.g. 96-well plate, 250RPM, 37C)

Importing data

Data is imported using the parse_raw_data function from the .parsing module. This function returns an Experiment, which is the result of organizing all of your data.

To parse data, the data is usually provided in an xlsx file in one of the desired formats. If your data doesn’t conform to one of the built-in formats, you can use the provided parsers as a cookbook to build your own. Generally, minor edits are required to conform to new data.

Here we use the sample test data, which is a typical format for data from HPLC. Each row is a specific trial and time points, and the columns represent the different analytes, and their types. You can see this data in tests/test_data/sample_input_data.xlsx

In [2]:
from impact.parsers import parse_raw_data
from pprint import pprint
expt = parse_raw_data('default_titers',file_name = '../tests/test_data/sample_input_data.xlsx')
expt.calculate()

Importing data from ../tests/test_data/sample_input_data.xlsx...0.1s
Parsed 2884 timeCourseObjects in 0.528s...Number of lines skipped:  0
Parsing time point list...Parsed 2884 time points in 2.7s
Parsing analyte list...Parsed 18 analytes in 633.9ms
Parsing single trial list...Parsed 18 replicates in 0.0s
Analyzing data...
c:\users\naveen\documents\university\grad school\university of toronto\research\python\impact\impact\impact\core\features.py:42: RuntimeWarning: invalid value encountered in true_divide
  self.substrate_consumed
c:\users\naveen\documents\university\grad school\university of toronto\research\python\impact\impact\impact\core\features.py:42: RuntimeWarning: divide by zero encountered in true_divide
  self.substrate_consumed
Ran analysis in 1.1s

The data is now imported and organized, we can quickly get an overview of what we’ve imported.

In [3]:
print(expt)
strain    media    environment    analytes
--------  -------  -------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
strain1            None           ['ethanol', 'glucose', '1,3-butanediol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain2            None           ['ethanol', 'glucose', '1,3-butanediol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain2            None           ['1,3-butanediol', 'glucose', 'ethanol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain3            None           ['1,3-butanediol', 'glucose', 'ethanol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain3            None           ['1,3-butanediol', 'glucose', 'ethanol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain4            None           ['glucose', '1,3-butanediol', 'ethanol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']
strain4            None           ['1,3-butanediol', 'glucose', 'ethanol', 'lactate', 'succinate', 'pyruvate', 'formate', 'R/S-2,3-butanediol', 'acetate', '3-hydroxybutyrate', 'acetoin', 'meso-2,3-butanediol', 'acetaldehyde', 'OD600']

Before we dive into data analysis, it is worth having a basic understanding of the schema to know where to look for data.

Firstly, all data is funneled into a ReplicateTrial, even if you only have one replicate. As such, it is convenient to always look for data in this object. This object contains both an avg and std attribute where you can find the respective statistics. avg and std attributes are instances of SingleTrial, so we can access the statistical data similarly to the raw data itself.

Querying and filtering for data

After import, data is all sorted into python objects, associated to an sql database using an object-relational mapper, SQLalchemy. Usually, we’re interested in comparing a set of features and a set of conditions (strain, media, environment) and the queryable database allows us to search for the data we are interested in.

Although it is usually simple to use the ORM to access the database directly, basic querying can also be done using python list comprehensions. The major limitation is that you will only query experiments loaded in memory, e.g. experiments that were parsed into this notebook.

In [4]:
print('All')
reps = [rep for rep in expt.replicate_trials]
for rep in reps:
    print(rep.trial_identifier)

print('Filtered')
reps = [rep for rep in expt.replicate_trials
       if rep.trial_identifier.strain.name == 'strain1']
for rep in reps:
    print(rep.trial_identifier)
All
strain: strain1,        media: ,        env: None
strain: strain2,        media: ,        env: None
strain: strain3,        media: ,        env: None
strain: strain4,        media: ,        env: None
strain: strain3,        media: ,        env: None
strain: strain4,        media: ,        env: None
strain: strain2,        media: ,        env: None
Filtered
strain: strain1,        media: ,        env: None

To use the database, we must query data through a session object. The session is open for the entire application.

In [10]:
from impact.database import session, create_db
create_db()
session
Out[10]:
<sqlalchemy.orm.session.Session at 0x24ed289cb70>

Now that we have a session, we can use the standard SQLalchemy ORM language to query - it is described in detail here http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#querying

In [12]:
session.add(expt)
In [15]:
reps = session.query(impt.ReplicateTrial)\
                .join(impt.ReplicateTrialIdentifier)\
                .join(impt.Strain)\
                .filter(impt.Strain.name == 'strain1').all()

for rep in reps:
    print(rep.trial_identifier)
strain: strain1,        media: ,        env: None

Visualization

Several packages already exist for visualization in python. The most popular one in matplotlib, it has very simple syntax which should feel familiar for matlab users; however, matplotlib generates static plots. The Impact visualization module is built around plotly, and as such it is worthwhile understanding the basic syntax of plotly charts.

In [23]:
import impact.plotting as implot
import numpy as np

# Charts are made up in a hierarchical structure, but can be quickly generated as follows:
x = np.linspace(0,10,10)
y = np.linspace(0,10,10)**2
implot.plot([implot.go.Scatter(x=x,y=y),
             implot.go.Scatter(x=x,y=y*2)])

# For more control over these plots, they can be built form the ground up
# Traces are defined for each feature
traces = [implot.go.Scatter(x=x,y=y),
             implot.go.Scatter(x=x,y=y*2)]

layout = implot.go.Layout(width=400)

# Traces are joined to a figure
fig = implot.go.Figure(data=traces, layout=layout)

# And a figure is printed using plot
implot.plot(fig)

It should be noted that the implot package offers a direct wrapper to useful plotly functions, which could also be accessed with plotly directly. The Impact visualization module offers functions to help extract useful data and generate traces.

In [30]:
from impact.plotting import time_profile_traces

implot.plot(time_profile_traces(replicate_trials=expt.replicate_trials,
                               analyte='ethanol'))
In [ ]:

# Grab the average data
rep_list = [rep for rep in expt.replicate_trial_dict.values()]

etoh_datas = []
for rep in rep_list:
    etoh_datas.append({'avg': rep.avg.analyte_dict['ethanol'],
                      'std': rep.std.analyte_dict['ethanol']})
    for rep_id in rep.single_trial_dict:
        etoh_datas[-1][rep_id] = rep.single_trial_dict[rep_id].analyte_dict['ethanol']

for etoh_data in etoh_datas:
    layout = implot.go.Layout(title=str(etoh_data['avg'].trial_identifier))
    implot.plot(implot.go.Figure(data=[implot.go.Scatter(x=etoh_data[key].time_vector,
                                   y=etoh_data[key].data_vector,
                                  name=key) for key in sorted(etoh_data)],layout=layout))

# implot.plot([implot.go.Scatter(x=etoh_data.time_vector, y=etoh_data.data_vector) for etoh_data in etoh_datas])

print(etoh_data)
# and the data for each replicate
# for rep_id in rep.
# for analyte_data in etoh_data['avg']:
#     print(analyte_data.trial_identifier)
In [ ]:
import impact.plotting
import plotly.graph_objs as go
from plotly.offline import iplot

iplot(
    [go.Scatter(
            x=analyte_data_avg.time_vector,
            y=analyte_data_avg.data_vector,
            name=str(analyte_data),
            error_y=dict(type='data', array=analyte_data_std.data_vector)
               )
     for analyte_data_avg, analyte_data_std in zip(ethanol_avg_data, ethanol_std_data)]
)

Exploring features

With a standard schema for the data, we can now begin to explore some of the features which have been generated. Features include things like:

  • rate (\(g\ \ h^{-1}\))
  • yield (\(g_{product}\ \ g_{substrate}^{-1}\))
  • specific productivity (\(g\ \ gdw^{-1}\ \ h^{-1}\))
  • normalized data (e.g. \(a.u. fluorescence\ \ OD_{600}^{-1}\))
In [ ]:
ethanol_avg_specific_productivity = [expt.replicate_experiment_dict[replicate_key].avg.analyte_dict['ethanol'].specific_productivity for replicate_key in expt.replicate_experiment_dict]
# ethanol_std_specific_productivity = [expt.replicate_experiment_dict[replicate_key].std.analyte_dict['ethanol'].specific_productivity for replicate_key in expt.replicate_experiment_dict]

iplot(
    [go.Scatter(
            x=analyte_data_avg.time_vector,
            y=analyte_data_avg.specific_productivity,
            name=str(analyte_data),
#             error_y=dict(type='data', array=analyte_data_std.product_yield)
               )
     for analyte_data_avg, analyte_data_std in zip(ethanol_avg_data, ethanol_std_data)]
)

Or maybe just the endpoints..

In [ ]:
ethanol_avg_data = sorted(ethanol_avg_data,key=lambda x: x.pd_series.iloc[-1])

iplot(
    [go.Bar(
            x=[str(analyte_data_avg) for analyte_data_avg in ethanol_avg_data],
            y=[analyte_data_avg.pd_series.iloc[-1] for analyte_data_avg in ethanol_avg_data],
            name=str(analyte_data),
#             error_y=dict(type='data', array=analyte_data_std.product_yield)
               )]
)