# Genome Evolution in Host-Pathogen Systems

written in

## Finally the host pathogen entry!

This entry is the reason behind setting up the blog. The idea is to build up a mathematical-computational model of plant pathogen’s genome evolution to explore different scenarios such as evolutionary responses to host jumps. In the following I will be presenting and discussing every aspect of the model in detail, maths, code, etc. Then (soon I hope) it will be turned into a manuscript.

Perhaps the final version will be published in Nature or Science and the results will grant me a Nobel prize and the blog a Pulitzer, or perhaps it might not be the case, however, it is an experiment which will allow me to share with everyone the ideas, techniques and intellectual process involved in carrying out research of this nature in real time.

One of the advantages of this format is that I can share not only ideas and calculations but also code!. In fact, both GitHub and Octopress are mostly used for that purpose, so yeah, I will be sharing and describing code bits and pieces where appropiate. The full code (and more…) to reproduce the results described here are hosted and documented in its own repository.

For this project I am using python, (I do a lot of testing and tinkering in IPython, I strongly advice you to do the same!). The model heavily relies on the sampling of random variables and because I love and respect the algorithms contained in the GNU Scientific Library I ended up using the pygsl wrapper, however NumPy and SciPy also contain well documented and functional modules.

As expected with this sort of thing it would be really cool if you would like to leave a comment or e-mail me about the ideas presented here (computational, biological or mathematical). In any case, having said all the above let’s begin without further ado.

## Index

1. Introduction
2. Model
3. Results
4. References

## Introduction

Comparative genomic analysis between elements in clades of plant pathogens exhibit some striking differences in length and genome architecture, such divergences are mostly localised in genes and regions of the genome which are in charge of coding the host-pathogen interactions. These domains present a high content of transposable elements embedding sparse genetic units in charge of establishing the trophic link with the host. Those regions and genes are the object of this study.

The analysis also suggests that such domains of non-coding and effector genes could be the result of processes related to extreme environmental pressures, specifically a shift from one host into another (host jump).

We will restrict to the case in which the set of effector proteins carried by the pathogen is already successful in invading and tinkering the host’s biochemistry, and we do not be considering any co-evolutionary processes.

## Model

The genes responsible for the host-pathogen interaction are what we will call here effector genes (EG), we will assume that these EGs encode proteins which mediate the biochemical processes which allow the pathogen to extract resources from the host, by establishing links with target units (TU) or molecules within the host.

### I. Host Target Sets.

We will identify the set of all TUs with a set of integers $\mathcal{T}=\{1,\dots,K_T\}$ from which we can choose a subset to form particular instances of hosts. For example if $K_T=5000$, then we can form two hosts by forming ordered arrays of some length $L<K_T$, such as $h_1=(1,10,100,1000)$ and $h_2=(20,90,835)$.

In python this can be achieved efficiently in a number of different ways. In order to illustrate one method and to introduce the way in which we will be using the modules we write for the model, I will write a function which will create lists of TU numbers with no repeated elements. The function will take three arguments: the length of the list $L$, the value $K_T$ and the handler for the random number generator which we will call rk. The function looks like:

Although Python is a very easy language to use, a word of caution most be said at this stage. In Python we should be aware of the indentation level of the statements at all times! Indeed, this is one of those scripting languages in which indentation matters, and if it is not correct the code will not work. So keep it in mind always.

The function is called NEWHOST, and saved it in a file named newhost.py. This and every other file related to the model will be placed in a folder called codes. Of particular interest is a file named as hpmodel.py which I will use to import every function separately. For instance to import the host generating function we must add to it the line:

The setup described so far will allow us to call hpmodel and all the custom functions we import to it in the same way we use any other python module. For instance:

These lines import some basic python modules which will be useful later, but also import the custom hpmodel with the alias mda, then if we execute something like:

We will end up with three dictionaries having ten diferent hosts. The one called HSTa populated with the previously described function. The other two populated with lists of correlated host (HSTb) and HSTc populated with lists of different lenghts (padded with zeroes) respectively. Both modules also imported from hpmodule. A typical output should like similar to:

Once we have a function to generate lists of TUs, which we will be considering these as the inner degrees of freedom of our model hosts, we now move into describing in detail our model pathogen genomes

### II. Effector Units.

As stated earlier, we are interested in the structural aspects of the pathogen’s genome evolution. We are not addressing directly the dynamics of gene frequencies which is described by the theory of population genetics. We do consider that the changes induced in the genome are the results of processes such as mutation, selection, drift coupled with dynamics at a different scales such as horizontal gene transfer every single process will be explained in detail the forthcoming sections.

The first we would like to introduce is the model for EGs. We will define EGs by a number of attributes rather than its particular nucleotide sequence. For instance, we will consider its number of bases (length), the existence and strength of links between the EG and its particular list of TUs within a host, etc.

First and foremost we will introduce the list of possible targets a particular EG is able to connect to out of the universal set $\mathcal{T}$. We call this list the EG’s adjacency list. To construct the adjacency set of a given effector we introduce a parameter $c \in (0,1)$ which we will use to assign the initial number of targets in the list accesible to that effector.

If we want the initial number of TU’s of every EGs present in the genome to be (on average) equal to $cK_T$, we need to sample numbers $n_i$ from a binomial distribution with parameters $c$ and $K_T$. Then the initial adjacency list for each EG can be generated by using our function NEWHOST passing to it $n_i$ and $K_T$, for example:

Which should produce an output similar to:

The above example shows the way we will generate the initial adjacency list for every EG in a genome.These list will determine if the effector is able or not to gain resources from the host, by comparing the elements present in its list with the ones present at the hosts.

The adjacency lists are not the only attribute we will be considering to determine interactions between a give EGs and a host. We will also like the strength of such interaction to be evolving quantities.

Link Weights and Effector Scores.

To quantify the host-pathogen interaction dynamics is perhaps the most complicated aspect when one attempts to build up models which operate at this mesoscopic level of description. To do so we will introduce two quantities which we call the effector-target link scores denoted by $s_{e,t}$ and the effector-target gain $g_{e,t}$. The former attempt to capture how good does an effector is in obtaining resources for the pathoghen.