Tuesday, November 14, 2017

Open Science

Nothing really original in this post. I just recollect what already said in the FosterOpenScience web pages. Their definition is:

"Open Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools (European Commission, 2016b:33). The OECD defines Open Science as: “to make the primary outputs of publicly funded research results – publications and the research data – publicly accessible in digital format with no or minimal restriction” (OECD, 2015:7), but it is more than that. Open Science is about extending the principles of openness to the whole research cycle (see figure 1), fostering sharing and collaboration as early as possible thus entailing a systemic change to the way science and research is done." The wikipedia page is also useful.

This approach get along with the one of doing reproducible research which I already talked about several times. I do not have very much to add to what they wrote, but I also want to make you note that "there are in fact multiple approaches to the term and definition of Open Science, that Fecher and Friesike (2014) have synthesized and structured by proposing five Open Science schools of thought" .

In our work a basic assumption is that openness require also the appropriate tools, and we are working hard to produce them and use those other that make a scientific workflow open.

Friday, November 10, 2017

About Benettin et al. 2017, equation (1)

Gianluca (Botter) in his review of Marialalaura (Bancheri) Ph.D. Thesis brought to my attention the paper Benettin et al. 2017. A great paper indeed, where a couple of ideas are clearly explained:

  • SAS functions can be derived from the knowledge of travel and residence times probability
  • a virtual-experiment where they show that traditional pdfs (travel times pdf) can be seen an the ensamble of the actual time-varying travel times distributions.

The paper is obviously relevant also for the hydrological contents it explains, but it is not the latter point the one which I want  to argue a little. I want here just to argue about the way they present their first equation.

SAS stands for StorAge Selection functions and they are defined, for instance in Botter et al. 2011 (with a little difference in notation) as:
$$
\omega_x(t,t_{in}) = \frac{p_x(t-t_{in}|t)}{p_S(t-t_{in}|t)} \ \ \ (1)
$$
as the ratio between the travel time probability related to output $x$ (for instance discharge or evapotranspiration) and the residence time probability.
In the above equation (1)
  •  $\omega_x$ is the symbol that identifies the SAS
  • $t$ is the clock time
  • $t_{in}$ is the injection time, i.e. the time when water has entered the control volume
  • $p_x(t-t_{in}|t)$ with $x \in \{Q, ET, S\}$  is the probability that a molecule of water entered in the system at time $t_{in}$ is inside the control volume, $S$, revealed as discharges, $Q$, or evapotranspiration, $ET$

Equation (1) in Benettin et al. is therefore written as
$$
\frac{\partial S_T(T,t)}{\partial t} + \frac{\partial S_T(T,t)}{\partial T} = J(t) - Q(t) \Omega_Q(S_T(T,t),t)-ET(t) \Omega_{ET}(S_T(T,t),t) \ \ \ \ (2)
$$
Where:

  • $T$ is residence time (they call  water age but this could be a little misleading because the water age of water in storage could be, by their own theory different in storage, discharge, evapotranspiration)
  • $S_T$ is the age-ranked storage, i.e. “the cumulative volumes of water in storage as ranked by their age” (I presume the word “cumulative”  implies some integration. After thinking a while and looking around, also to paper van der Velde et Al. 2012, I presume the integration is over all the travel times up to $T$ which, because the variable of integration in my notation is $t_{in}$ means that $t_{in} \in [t,t-T]$  )
  • $J(t)$ is  the precipitation rate at time $t$
  • $Q(t)$ is the discharge rate at time $t$
  • $\Omega_x$ are the integral of the integrated SAS function which are more extensively derived below.

In fact, this (2) should be just an integrated version (integrated over $t_i$) of equation (9) of Rigon et al., 2016:
$$
\frac{ds(t,t_{in})}{dt} = j(t,t_{in}) - q(t,t_{in}) -et(t,t_{in})
\ \ \ \ (3)
$$
where:
  • $s(t,t_{in})$ is the water stored in the control volume at time $t$ that was injected at time $t_{in}$
  • $j(t,t_{in})$ is the water input which can have age $T=t-t_i$
  • $q(t,t_{in})$ is the discharge that exits the control volume at time $t$ and entered the control volume at time $t_{in}$
  •  $et(t,t_{in})$ is the evapotranspiration that exits the control volume at time $t$ and entered the control volume at time $t_{in}$
In terms of the SAS and the formulation of the problem given in Rigon et al. (2016), the $\Omega$s can be defined as follows:
\begin{equation}
\Omega_x(T,t) \equiv \Omega_x(S_T(T,t),t) := \int_{t-T}^t \omega_x(t,t_i) p_S(t-t_i|t) dt_i = \int_0^{p_S(T|t)} \omega_x(P_S,t) dP_S
\end{equation}
Where the equality ":=" on the l.h.s is a definition, so the $\Omega$s ($\Omega_Q$ and $\Omega_{ET}$) are this type of object. The identity $\equiv$ stresses that the dependence on $t_in$ is mediated by a dependence on the cumulative storage $S_T$ and $T$ is the travel time. As soon as $T \to \infty$, $\Omega \to 1$ (which is what written in equation (2) of Benettin's paper). This is easily understood because by definition ${\omega_x(t,t_i) p_S(t-t_i|t)} \equiv {p_x(t-t_i|t)}$ are probabilities (as deduced from (1)).
An intermediate passage to derive (2) from (3) requires to make explicit the dependence of the age-ranked functions from the probabilities. From definitions, given in Rigon et al., 2016. It is
$$
\frac{d S(t) p_S(t-t_{in}|t)}{dt} = J(t) \delta (t-t_in) - Q(t) p_Q(t-t_{in}|t) - ET(t) p_{ET}(t-t_{in}|t)
$$
which  is Rigon et al. equation (14).
Now integration over $t_i \in [t-T, t]$ can be performed to obtain:
$$
S_T(t_{in},t):= \int_{t-T}^T s(t_{in},t) dt_{in}
$$
and, trivially,
$$
J(t) = J(t) \int_{t-T}^T \delta(t-t_{in}) dt_{in}
$$
while for the $\Omega$s I already said.
The final step is finally to make a change of variables that eliminate $t_{in}$ in favor of $T := t-t_{in}$. This actually implies the last transformation. In fact:
$$
\frac{dS(t,T(t_{in},t))}{dt} =\frac{\partial S(t,T(t_{in},t))}{\partial t} + \frac{\partial S(t,T(t_{in},t))}{ \partial T}\frac{\partial T}{ \partial t} = \frac{\partial S(t,T(t_{in},t))}{\partial t} + \frac{\partial S(t,T(t_{in},t))}{ \partial T}
$$
since $\partial T/\partial t$ =1. Assembling all the results, equation (2) is obtained.

Note:
Benettin et al., 2017 redefines the probability $p_S$ as “normalized rank storage … which is confined in [0,1]” which seems weird with respect to the Authors own literature. In previous papers this $p_S$ was called backward probability and  written as $\overleftarrow{p}_S(T,t)$. Now probably they have doubt we are talking about probability.  In any case, please read it again: "normalized rank storage … which is confined in [0,1]”. Does not sound unnatural is not a probability ? Especially when you repetitively estimate averages with it and comes out with “mean travel times”?Operationally, it IS a probability. Ontologically the discussion about if there is really random sampling or not because there is some kind of convoluted determinism in the travel times formation can be interesting but it brings to a dead end. On the same premised we should ban the word probability from the stochastic theory of water flow, that, since Dagan has been enormously fruitful.

This long circumlocution,  looks to me like the symbol below

or TAFKAP, which was used by The Artist Formerly Known As Prince when he had problems with his record company.
In any case, Authors should pay attention in this neverending tendency to redefine the problem rather beacause it can look what Fisher (attribution by Box, 1976) called mathemastry. This is fortunately not the case of the paper we are talking about. But then why not sticking with the assessed notation ?

The Authorea version of this blog post can be found here.

References 

Tuesday, October 31, 2017

Meledrio, or a simple reflection on Hydrological modelling - Part VI - A little about calibration

The normal calibration strategy is to split the data we want to reproduce into two setz:

  • one for the calibration phase
  • one for the "validation" phase
Let's assume that we have an automatic calibrator. It usually:
  • generates a set of model's parameters, 
  • estimates with the rainfall-runoff hydrological model and any given set of parameters the discharges, 
  • compares what computed with what is measured by using a goodness of fit indicator
  • keeps the set of parameter that gives the best performances
  • repeats the operation a huge number of times (and use some heuristics for searching the best set overall)

This  set of parameters is the one used for "forecasting" and

  • is now used against the validation set to check its performances.
However, my  experience (with my students who usually perform it) is that the best parameter set in the calibration procedure, is not usually the best in validation procedure. So I suggest, at least as a trial and for further investigations to:

  • separate the initial data set into 3 parts (one for first calibration, one for selection, and one for validation).
  • Among the 1% (or x% where x is let at your decision) of best performing in the calibration phase  is selected (called the behavioural set). Then 1% (one over 10^4) best performing in the selection phase is further sieved. 
  • This 1 per ten thousand is chosen to be used in the validation phase
The hypothesis to test is that this three steps way to calibrate returns usually better performances in validation than the original two step steps one.

Sunday, October 29, 2017

Open Science Framework - OSF

And recently I discovered OSF, the Open Science Framework. My students told me that there exists many of them, of this type of on-line tools that make leverage of the cloud to store and helps groups to manage their workflow. However, OSF seems particularly well suited to work for scientists’ group, since it contains links various science-oriented features, like connections to Mendeley, Figshare, Github and others. An OSF “project” can contain writings, figures, codes, data. All of this can be uploaded for free in their servers or being maintained in one of your cloud storage like Dropbox or GoogleDrive. 

For starting, you can take one our of your time to follow one of their YouTube video, like the one below.

Their web page contains also some useful guides that make the rest (do not hesitate to click on icons: they contain useful material!). The first you can start with is the one about the wiki, a customizable initial page that appear in any project or sub-project. There are some characteristics that I want to emphasize here. Startin a new project is easy, and when you have learn how to do it, you almost have learn all of it. Any project can have subprojects, called “components”. Each component behaves like a project by itself, so when dealing with it, you do not have to learn something really new. Any (sub)project can be private (the default) or public, separately, and therefore your global workflow can contain private and public stuff. 

Many people are working on OSF. For instance Titus Brown’s Living in a Ivory Basement blog also has some detailed review of it. They also coded a command line client for downloading files from OSF which can be further useful.  

Wednesday, October 25, 2017

Return Period


Some people, I realised, have problems with the concept of return period. This is the definition in wikipedia (accessed October 25th, 2017):
A return period, also known as a recurrence interval (sometimes repeat interval) is an estimate of the likelihood of an event, such as an earthquake, flood[1], landslide[2], or a river discharge flow to occur.
It is a statistical measurement typically based on historic data denoting the average recurrence interval over an extended period of time, and is usually used for risk analysis (e.g. to decide whether a project should be allowed to go forward in a zone of a certain risk, or to design structures to withstand an event with a certain return period). The following analysis assumes that the probability of the event occurring does not vary over time and is independent of past events.
Something that wikipedia does not include is rainfall intensity. The first paragraph, should be then something like:
"A return period of x time units, also known as a recurrence interval (sometimes repeat interval) is an estimate of the likelihood of an event, such as an earthquake, flood[1], landslide[2], rainfall intensity, a river discharge flow or any observable, to occur (or be overcome) on average every x time units."

Return period clearly involves a statistical concept, which is traced back to a probability, and a time concept, that is the sampling time.
Let us assume we have a sequence of data, for which, at moment, the sampling time is unknown, composed by a discrete number, $n$, of data.
The empirical cumulative distribution function (ECDF) of the data is a representation of the empirical statistics for those data. Let $ECDFc$ be the complementary empirical cumulative distribution function, meaning $ECDFc(h) \equiv 1 - ECDF(h)$.
Let h* be one of the possible values of these data (not necessarily present in the sequence but included in the range of experimental values). We are interested in the probability of $h^*$ being overcome. If $m$ is the number of time $h^*$ is matched or overcome, then
$$ ECDFc(h^*)= m/n $$
$$ECDF(h^*) = 1 - m/n$$
We can, at this point assume that ECDF resembles some probability function, but this is a further topic we do not want to talk about here. What we want to stress i s that ECDFs (probabilities) are not automatically associated to a time. All the data in the sequence refers to different picks of a random variable, and these picks are not necessarily time-ordered or can be happened all at the same time. So the “frequencies" that can be associated to the above events are not time frequencies.
Now introduce time by saying that, for instance, each datum was sampled at regular time step $\Delta t$, what before I called “time units”, and, for practical reasons we are not interested to the ECDF of the data but to know how frequently (in clock time sense) it is repeated. So, we can say that the total time of our record is
$$T = n\, \Delta t$$
and in this time span, the number of time, h* is overcome is (by construction)
$$m=ECDFc(h^*)*n$$
On average, along the record obtained, the time frequency on which values greater than $h^*$ are obtained is the empirical return period:
$$T_r:=\frac{T}{m} =\frac{n *\Delta t}{ECDFc(h^*)*n} = \frac{\Delta t}{ECDF(h^*)}$$
So, the empirical return period of $h^*$ is inversely proportional to the complementary ECDF($h^*$) but, properly there is a “$\Delta t$” to remind that it is given in time units. One basic assumption in our definition is that the underneath probability is well defined, which is not if climate change is in action. This is a delicate and well discussed topic*, but again, not the core of this page.

There is a crucial initial step, the sampling of data which affects the final result. If the data in the sequence are, for instance annual maxima of precipitation, then the return period is given in years. If the data were daily precipitation totals, then the return period is given in days. And so on. Because usually the time unit has value “1” (but dimension of a time), the numeric value of the return period is just the inverse of the ECDFc. We should not forgot, however, that the equation contains a mute dimension. We are talking about times, not dimensionless numbers (probabilities).

Being Bayesian, probably you can introduce this in a different way. I let you as an exercise to do it.

** On the topic of stationarity, please give a look to:

Milly, P. C. D., Betancourt, J., Falkenmark, M., Hirsch, R. M., Kundzewicz, Lettenmaier, D. P., & Stouffer, R. J. (2008). Stationarity Is Dead: Whither Water Management? Science, 319, 1–2.

Montanari, A, and Koutsoyiannis, D,  Modeling and mitigating natural hazards: Stationarity is immortal!, Water Resources Research, 50 (12), 9748–9756, doi:10.1002/2014WR016092, 2014.

Serinaldi, F., & Kilsby, C. G. (2015). Stationarity is undead: Uncertainty dominates the distribution of extremes. Advances in Water Resources, 77(C), 17–36. http://doi.org/10.1016/j.advwatres.2014.12.013

Sunday, October 22, 2017

Simple models for hydrological hazard mapping

This contains the second talk I gave to high-school teacher at MUSE for the Life Project FRANCA. My intention was to show (under a lot of simplification assumptions) how hydrological models work, and give a few hints on which type of hydraulics models of sediment transport can be useful.
 Clicking on the figure above you can access the slides (in Italian but with a little time, I will provide a translation). In their simplicity, the slides are a storyboard for action that could be taken in the SteepStream project to provide an estimation of hazards of Meledrio river basin (and the other two selected).

Friday, October 20, 2017

On some Hydrological Extremes

This is the talk given at MUSE for the Life FRANCA Project. Life FRANCA has the objective to communicate with people about hydrological hazards and risk. In particular the Audience in this case was composed by high school teachers.


Clicking on the Figure you will be redirected to the presentation.