intro and slide 1
00:07
so
00:11
okay so you think i made you co-host
00:19
it's not working
00:42
okay
00:50
[Music]
00:56
so
01:05
okay i think there are no more people in
01:08
the waiting room
01:09
so we have a slightly different setting
01:11
this time
01:12
uh can somebody confirm that you can
01:15
hear me
01:17
yes okay perfect great so we have a
01:20
slightly different setting because
01:22
i um today we start a new topic and i
01:25
need my computer for that
01:27
and um so
01:31
in in the previous lectures before
01:33
christmas i gave you
01:35
an intro introduction uh about the
01:38
methods
01:39
that we need the theoretical methods
01:40
that we need in order to understand how
01:43
order emerges in non-equilibrium systems
01:47
and i also we also discuss how
01:50
order manifests in non-equilibrium
01:54
systems so now in the new
01:57
year we will look at the other side
02:01
we'll now introduce methods
02:04
that allow us actually to identify order
02:08
in experimental data
02:12
and of course i'm not talking about a
02:15
few
02:16
data points here we're talking about
02:18
large
02:19
high dimensional data sets as have
02:21
become common
02:23
in many fields of science like social
02:25
science but also biology now
02:28
so to begin let me share the screen
02:31
so i hope that rooks
02:37
so i hope that works
02:49
so actually that was a lecture that i
02:50
gave last year already
02:53
you know and um so
02:56
can you see you can't you can see the
02:58
screen i hope right
03:00
okay so just trying to get the meeting
03:02
controls
03:05
[Music]
03:11
okay it doesn't matter okay so i i
03:14
assume you can see the slides
03:15
now let me just uh make them full screen
03:20
now that's actually from a course that i
03:22
gave last year together with
03:24
fabian ross of course data science for
03:26
physicists
03:28
and today we'll discuss the methods that
03:30
we need
03:31
that actually allow us to identify
03:35
order in high dimensional data sets
03:38
and what is what do i mean with high
03:40
dimensional and large data sets
slide 2
03:42
this is one of the data sets that i
03:45
showed you in the very
03:46
beginning of the lecture that motivated
03:49
partially this lecture and this is a
03:52
data set that
03:53
was produced by collaborators in
03:55
cambridge
03:57
and in this data set on the x-axis you
04:00
basically have part of the dna
04:03
so in the dna you have maybe a billion
04:06
or three billion base pairs
04:08
three billion base pairs roughly and
04:10
mouse and human
04:12
and in these experiments for each of
04:14
these base pairs on the dna
04:16
or each of these positions for each of
04:18
these three million
04:20
uh sites of the dna we can make take a
04:24
measurement
04:25
and measure whether there is a chemical
04:28
modification
04:29
on the dna or not yeah and
04:34
on the y axis we have different cells so
04:38
we can do that
04:39
in individual cells and on the y axis we
04:41
have roughly 600 cells
04:43
from a mouse embryo so from each of for
04:47
each of these
04:50
cells we can take roughly 3 billion
04:53
measurements here along the sequence of
04:56
the dna
04:58
that means that we have a data set here
05:00
that is typically a few terabyte in size
05:03
and that has three billion dimensions
05:06
of course not all of these dimensions
05:08
are really meaningful
05:10
but to start with we have something that
05:12
is very
05:13
huge in terms of size and now we need
05:17
something we now release some tools some
05:19
computational tools
05:21
to process such data sets
05:26
yeah so what we want to do is we want to
05:29
start with
05:31
these measurements that i just showed
05:32
you now for example we use
05:35
we measure that that's a biological
05:37
figure it's not so important to
05:38
understand the details
05:40
i'm sorry because i think i only can see
05:44
your first
05:45
slide which slide can you see
05:48
data science oh now it's uh
05:52
okay okay okay okay so
05:55
how can we do that now it's working i
05:58
think
05:58
now it's working uh go to the next slide
06:00
is that working
06:01
it is the second slide sequencing cells
06:03
yes
06:04
no it's not okay okay let me stop
06:08
sharing
06:11
let me share the desktop maybe
06:15
um okay can you see
06:20
the title slide yes now the first slide
06:24
is that when i think so yes yes and now
06:27
uh
06:28
another slide yes okay great great
06:31
so i was talking all the time probably
06:32
you didn't see this cell this slide
06:34
before here
06:36
yeah so this is what was on the x-axis
06:38
here
06:39
on the x-axis we have these three
06:41
billion base pairs
06:43
and for these base pairs we can make
06:44
measurements on the x-axis let me
06:48
get a pointer
06:51
so we have here the x-axis and this
06:53
x-axis
06:54
has these three billion dimensions
06:57
these three billion measurements for
06:59
every base pair of the dna
07:01
that we can take on the y-axis
07:04
each row on the y-axis is a different
07:07
cell
07:08
and for each of these cells that are
07:09
taken from a mouse embryo
07:11
we can take these measurements these
07:15
hugely dimensional measurements uh
07:18
in this case a living embryo now
07:22
these are breakthroughs that happen in
07:23
biology but also similar breakthroughs
07:25
with similar detail we can measure
07:28
social
07:29
systems for example if you think about
07:31
the social sciences social networks
07:33
and so on we have huge amounts of of
07:36
data
slide 3
07:37
that we would like to understand yeah
07:39
this particular example
07:42
of biology we would like to understand
07:46
how we can transform the top part of
07:49
this big picture here
07:51
where we take measurements that have
07:53
lots of different dimensions
07:55
for this case many different cells you
07:58
can do that with
07:59
millions of cells if you want if you
08:00
have the money now
08:02
how can we transform these measurements
08:05
into a physical
08:06
theory of what's generating these
08:09
measurements
08:11
now that's what's that's what we want to
08:13
do and
slide 4
08:14
uh to do this we need to identify
08:18
order in these structures in order to
08:20
build a hypothesis
08:22
now this is the rule this is how this
08:24
this kind of
08:26
data sets arrive on our desk
08:30
so here you see tables that contain
08:33
different bits of information
08:36
for example on the
08:40
top right here you have a table that
08:43
tells us
08:44
where on the dna we have chemical
08:46
modifications
08:47
now that has in this case 50 million
08:50
rows
08:51
we have another table here on the top
08:55
left that tells us something about
08:58
the same position on the dna it tells us
09:02
which what is the topological structure
09:05
of the dna is it
09:07
is it compact is it is it like a
09:09
spaghetti ball is it more open
09:11
or not tells us about the something
09:13
about how this dna
09:14
lives in real space now then we have
09:18
other kinds of information and we can
09:20
ask for example
09:21
how
09:24
how is this particular bit on the dna
09:29
what else do we know about that is there
09:32
a gene in this region
09:34
is there some other interesting stuff
09:35
going on in this region
09:37
now we can ask what happens by the to
09:40
these genes what are the
09:42
genes doing that's on the topic of
09:44
different experiments again
09:46
in these regions of the dna yeah and we
09:49
then have for example information about
09:51
these cells where are they located in
09:52
the embryo
09:54
uh what were the cultural conditions of
09:56
these cells for example and so on
10:00
now so now we have these uh
10:02
high-dimensional data sets from
10:03
different sources and we now
10:05
need to combine them to create a big
10:08
picture that allows us as physicists
10:11
to generate a hypothesis
slide 5
10:14
yeah so
10:18
and this is an overview of this lecture
10:20
today and i have to uh
10:22
make a shocking uh confession that will
10:25
actually be using
10:26
r in this lecture that's something we're
10:28
not using typically in physics
10:30
but in this case it actually makes sense
10:32
it's actually suited but the syntax of
10:34
r is a little bit uh not what we're used
10:37
to in
10:37
physics so i'll give you a quick
10:40
introduction
10:41
to r to the language of r and
10:44
i assume that most of you have already
10:47
learned program and some programming in
10:49
some kind of other language like matlab
10:52
or python or c
10:55
so i'll just give you a quick
10:56
introduction to the syntax of
10:58
r and then i'll show you what are the
11:01
tools
11:02
that we have available to deal with
11:06
these large
11:07
data sets that are coming up in science
11:11
yeah so so what are the tools that we
11:13
have available what are the
11:14
computational tools and how do we select
11:17
the tools that we actually want to use
11:21
now then the very important thing is to
11:23
bring the data that has many dimensions
11:26
in a shape that we can actually deal
11:29
with
11:30
in a computationally efficient way now
11:33
that's called tidying the data to bring
11:35
it a specific format
11:36
so that we can vectorize our
11:39
computational or numerical operations on
11:42
this data sets
11:43
as i'll show you once we've done that
11:46
how we can then perform
11:48
very easily computations on
11:51
these data sets and uh and finally i'll
11:55
end with showing you how we can combine
11:57
this
11:59
these steps to produce data processing
12:01
pipelines
12:03
and of course we want to do all of this
12:05
in an elegant way
12:07
that is not so to say heavy in terms of
12:09
syntax we want to focus on the structure
12:11
of the data
12:12
but not on what is actually
12:15
on writing things and see also where you
12:18
have like hundreds of lines of codes for
12:20
a simple
12:21
operation yeah so this is the structure
12:24
and if you have time in the end i'll
12:26
uh show you something more
12:30
on data visualization
slide 6
12:34
okay so this is arya so most of you
12:36
won't have used r before so we
12:39
don't use that in physics very much uh
12:42
r is um it's not better than python or
12:47
anything as it's very similar to python
12:49
and that it's an interpreted language
12:52
it's extremely flexible
12:54
so nothing is fixed in rng it's like you
12:56
have type function
12:58
that rewrites itself while
13:01
you execute this function it's called
13:03
extreme dynamism
13:06
um r is very popular in statistics
13:09
now you probably have heard in this
13:10
context
13:12
and uh it's very easy to include a c
13:15
code in r so if you worry about speed
13:17
you can always
13:18
write things and see and that's also
13:20
what we're usually doing
13:22
the main advantage of r is that's a huge
13:25
repository
13:27
of packages available particularly in
13:29
data science
13:31
genomics biology and so on
13:34
and one particular thing that is
13:37
actually
13:38
of the most practical relevant is that
13:41
it has
13:42
extremely convenient and high quality
13:45
plotting
13:46
functions yeah
13:49
so so these are the benefits of our that
13:51
of some downsides the syntax is very
13:54
difficult for our physicists to swallow
13:57
i'll show you later why and typically
13:59
because
14:00
nothing is fixed in r it's very dynamic
14:03
it's typically slower than python
14:05
in general terms so you wouldn't use
14:08
r to write as stochastic simulation or
14:11
something like this
14:13
but for the tasks that we're doing today
14:15
is actually uh
14:17
actually a very good choice now so it's
14:20
slowest
14:21
r is typically slow but the core
14:23
functions
14:24
are written in c so once you once you
14:27
rely on
14:28
core functions on vectorization and
14:30
things like that
14:31
then it's very fast but you know i have
14:33
to know how to use it
14:35
yeah but taken together there's no
14:37
particular reason
14:38
for for not using python for this
14:42
the only reason is that for the
14:45
for the um for the tables and for the
14:48
data i showed you the previous slides
14:50
r has been so to say the standard
14:53
language
14:54
in these experiments in genomics and
14:57
there's a lot of packages and a lot of
15:00
libraries developed for these genomics
15:03
data
15:05
that are collected in huge repositories
15:08
in our
15:10
meanwhile python is sketching a little
15:11
bit up uh
15:13
in this context so but that's why while
15:15
we're using r in
15:16
our group uh it's the context of
15:18
genomics these particular experiments
15:20
that i showed you on the previous slide
slide 7
15:25
so typically when you use r you use it
15:28
in a
15:30
development environment an interacting
15:32
environment
15:34
that you can download from rstudio.com
15:36
and then you have
15:37
something that you looks very familiar
15:39
to you if you use matlab
15:41
or python you have a notebook here
15:44
a little bit like a jupiter notebook if
15:46
you use
15:47
python for example now where you can
15:50
write code and where you then have also
15:52
then directly the output of your code
15:54
and the same uh the same and the same
15:57
window in the same file you can
15:58
export html documents
16:03
if you want you have a console here to
16:06
type in
16:07
code now and then you have your
16:09
workspace like in matlab where you see
16:11
your variables
16:13
and you have a window that that is made
16:16
for looking at help
16:17
pages and looking at plots and other
16:20
stuff
16:22
now and um i'll show you
16:25
later getting a little bit more
16:26
interactive i'll show you later how this
16:28
works and
16:29
practice here but this is just how how
16:32
working in r looks like looks exactly
16:34
like working in any other language
16:36
actually
slide 8, 9, 10, 11
16:38
yeah so so this is some basic r
16:41
so this this is just just to show you
16:44
that
16:44
r looks very familiar to uh
16:48
other languages actually in many in many
16:51
respects
16:52
and before i do that let me just show
16:54
you how these
16:55
what these boxes here mean and i'll
16:58
share
16:58
for that a little
17:02
our window
17:05
so let's go here
17:09
um
17:12
let me show the screen again
17:14
[Music]
17:16
okay now you should see an
17:19
r you know you should see this r studio
17:22
window
17:23
now so this that i just showed you and
17:26
here we have the code
17:28
now i can run the code and click on this
17:30
uh
17:31
little arrows here and then i run a
17:34
certain
17:34
chunk of code i can run this if i want
17:38
and load some data and i have a console
17:41
down here
17:43
that allows me to type commands if i
17:45
don't want to use this notebox here
17:47
and if you look at the bottom i can now
17:50
type assign a variable
17:51
here a and set it to one with these
17:54
weird assignment operators
17:57
as in set a to one and if i press a
18:01
i just type a and enter then i'll get
18:04
this
18:04
output the the value that is stored in
18:08
a now i can also output more complicated
18:13
values like like this table here that i
18:16
already loaded yeah and then i can look
18:19
at this in the console
18:21
and this console output is now what you
18:23
see in the slides
18:26
let me open the slides again can you see
18:27
the slides
18:32
are we back are we back with the slides
18:36
yes okay great perfect so it's working
18:38
yeah
18:39
so so here we have this i mean i've made
18:41
this fake console here and
18:43
and powerpoint now this is just to show
18:46
you
18:47
it's basically the same as in any as
18:49
many other languages like python
18:51
so here i call the function yeah and
18:54
the arguments are given in these
18:56
brackets uh
18:57
let me give you um
19:04
let me give you a pointer
19:09
here we go so somebody wrote in the chat
19:11
that the text is very
19:13
tiny was that related to our studio to
19:15
this
19:17
development environment
19:22
yeah probably yes and so so we don't
19:25
rely on that very much
19:27
um yes rc
19:30
okay okay so that's good to know
19:34
okay so just to just show you if i type
19:38
like a function
19:39
i would have called a function and an r
19:42
i do it in the usual way i can give
19:44
different arguments to this function in
19:46
this case
19:47
i take a normal normally distributed
19:50
random variable
19:52
and i tell r to give me five of them
19:55
and then i have some parameters that i
19:58
can identify by names
20:00
now so so the parameter mean are set to
20:03
one
20:04
and the parameter standard deviation are
20:06
also set to one
20:07
now these parameters have names
20:09
sometimes they have names
20:10
and they can call them with their names
20:12
it's very convenient some if you have a
20:14
long
20:15
list of parameters and you don't want to
20:18
give them all
20:19
i also can write my own functions that
20:22
looks a little bit like
20:24
mac and c so here i define a function
20:28
that's called my sum and this function
20:31
has three parameters
20:32
a b and c and c has
20:36
a default value 1.
20:39
now and then this function returns a
20:42
value that is just
20:43
equal to the sum of these three
20:46
arguments here
20:47
a plus b plus c now if i call this
20:50
function
20:50
i set a to 1 and b to 1
20:53
and c is 1 by d4 so i have don't have to
20:56
state that
20:58
i only have to state that if if i want c
21:00
to be a different value
21:02
then this function returns the value of
21:04
three not just like in
21:06
any other programming language
21:09
now we also have loops of course an r
21:12
and so this is a for loop
21:14
where you can for example have a loop
21:17
that goes from one to five
21:19
and then i can print something out or so
21:21
i have a while loop
21:23
now so while some condition is
21:26
a full fields or we want to print
21:28
something
21:30
now typically in r you don't want to
21:33
have loops if you have a loop in
21:34
r uh that means that you're doing
21:37
something wrong
21:38
so these loops are slow because r is
21:41
slow
21:42
and if you have such a loop then it
21:45
means that you didn't vectorize
21:47
your operation that you're doing
21:49
something bit by bit that you could also
21:51
do
21:52
in one step now typically if you have a
21:55
loop
21:55
then there's something wrong with your
21:58
code
22:00
or you're very extreme extremely
22:01
inefficient
22:03
and so i guess i'll have like written
22:06
like many
22:07
thousands of lines of r code
22:10
uh in my life and i have had uh exactly
22:13
one loop and uh in this
22:17
in these many thousands of lines and
22:19
several years
22:20
and this is one loop i use for a
22:22
stochastic simulation
22:25
now the mistake was that you would never
22:26
write a stochastic
22:28
simulation r but i did that because it
22:31
was very lazy
22:32
now so if you use these loops then
22:34
there's something wrong
22:35
because these loops are very you can
22:38
typically replace them with much more
22:40
efficient operations now so these are
22:44
the standard constructs that you have in
22:45
any programming language
22:47
you also have the if clauses here like
22:50
if
22:50
the value of i is smaller than 5 then
22:53
five
22:54
print hello and otherwise print not
22:56
hello yeah so that's
22:58
that's just also like in any other
23:00
language and you use
23:01
these curled brackets like in c to
23:04
define the scope
23:06
of a certain uh frame of the of the code
23:11
now that's very should all be very
23:14
familiar
23:16
now where things get a little bit
23:18
different different
23:19
are the data types of r
23:22
now so they our house has standard data
23:25
types
23:26
now so for example here we can define
23:29
vectors in principle everything is a
23:30
vector
23:31
or that's the most simple data type your
23:34
a
23:35
for example the data the the the vector
23:38
a
23:39
we define with this c here
23:42
now that's a little bit strange so in
23:48
matlab
23:58
was that a question probably not
24:03
i'll just just repeat it if it was a
24:05
question okay
24:06
so this is uh this is just a vector
24:09
how can i go back in the code
24:16
yeah okay so this is a typical vector we
24:19
we use the the letter c
24:22
to create such a vector that means
24:24
combine
24:26
and this vector has the elements one two
24:28
three
24:29
and we can also put other stuff in this
24:31
vector like this nas
24:33
that's for example a measurement that
24:36
is not available that didn't work for
24:38
example yeah but it's very convenient to
24:40
have
24:41
a representation on the computer for
24:43
measurements that
24:44
didn't work for example we can also have
24:47
a vector of
24:48
other types so here this is a string or
24:50
character vector
24:52
of m and f's so we can define this in
24:55
the very same way
24:57
and we can access elements of such
24:59
vectors
25:00
with these squared brackets yeah like in
25:03
c
25:03
for example but others but unlike in c
25:07
we start counting by one and not by zero
25:11
the next element is a list then the next
25:14
data type is a list
25:15
and then now it gets more interesting a
25:18
list
25:19
can contain several elements of any type
25:23
for example a list what elements of a
25:26
list
25:26
can be vectors now for example if i want
25:30
to make a list
25:31
and the first element of the list is our
25:34
vector
25:35
a and the second element of the list
25:38
is our vector b now they have
25:42
different types but you can nevertheless
25:44
put them
25:45
together in a list and then i can access
25:49
these uh then i can access these
25:53
lists here these list elements by the
25:56
name so i gave
25:57
gave it a name number and gender
26:01
and if i want to access the first
26:03
element number by name then i just use
26:05
these dollar signs here
26:07
and i can also assign access this first
26:10
element
26:11
by its number then i use the double
26:14
squared brackets that's not so important
26:17
just to show you that these title types
26:18
exist
26:20
when it gets more important for
26:22
statistics is that we also have
26:25
categorial variables and that's
26:26
something probably you don't know from
26:28
from matlab
26:29
or c i don't know about python category
26:32
so
26:33
so we have here a vector
26:36
that saves the gender of something
26:40
of somebody yeah so we have a long and
26:43
then
26:43
so suppose we do this measurements like
26:45
um 100 millions
26:47
of times yeah and then we have males and
26:50
females
26:51
and in principle we could store
26:54
the words male and female 100 million
26:57
times
26:58
in memory but that would be very
27:01
inefficient
27:02
what we could do instead is to say
27:05
okay i have a variable that i can only
27:07
take two values
27:10
yeah so i need this one byte at most
27:14
uh to store these which of these two
27:17
values
27:18
a given element has and then
27:23
i save an additional to that i save
27:26
labels to these two values
27:28
and that's what a factor is doing it's a
27:30
categorial variable
27:32
that can only take limited number of
27:35
values
27:36
for example uh this valve this vector b
27:39
here can only take two values
27:41
and i tell r that this can only take two
27:44
values
27:45
by making this a factor now this this
27:48
category variable is called a vector
27:51
and when i type then look at this vector
27:56
f here then it gives me the output
27:59
m f and f m f yeah and it tells me these
28:03
levels here
28:05
these are the typical values these
28:06
elements can take
28:08
yeah and if i want to make these these
28:11
levels
28:12
of these labels of these two values
28:14
different
28:16
i can call these two female and male
28:20
you know by changing these levels and
28:23
then
28:23
i have i can output this again
28:26
and have male female female male and so
28:29
on
28:30
so what happens here in r is that still
28:33
internally i don't use any more memory
28:36
although the strings are longer
28:38
what i changed here is a lookup table of
28:40
r
28:41
where r looks up where these two values
28:45
how these two values my vector can take
28:47
are called
28:49
for any plotting or any printing
28:51
purposes
28:52
now that's a that's a factor and it's
28:54
very efficient
28:55
if you think about for example biology
28:58
for these measurements that i showed you
29:02
in the beginning you have these hundreds
29:04
of millions of billions of measurements
29:06
and you have 200 million measurements
29:10
for chromosome 1. now you could either
29:13
write in your memory have a vector
29:15
that has the element chromosome 1 200
29:18
million times
29:21
or you just save a number an identifier
29:24
for chromosome one
29:26
and give it a label like this one here
29:30
with a real name in a separate table
29:33
then you don't have to save the string
29:36
chromosome one two one a million times
29:38
but only once in this table where you
29:41
look up what is actually the name
29:42
of this factory level so this is a very
29:44
efficient way of saving
29:47
variables that have that appear
29:50
very often but can only take a limited
29:54
number of values and that's that called
29:56
a factor
29:58
now that's the data type that you're
30:00
probably not familiar with
30:04
yeah and then we can go on yeah and we
30:07
can
30:07
now go to data types that can really
30:10
store
30:11
the data that we're looking at for
30:13
example experimental data
30:15
and the simplest way you can do that is
30:17
called an r a data frame
30:20
a data frame python also has data frames
30:25
and as far as i know and so these data
30:28
frames are nothing but lists
30:30
of vectors as i remember the lists
30:34
now this is a list it can save any kind
30:37
of element that you want
30:39
and if we can if we if each element of
30:42
this list here
30:43
has the same length then we can
30:46
interpret this list as a table
30:51
yeah and this is what we do in these
30:52
data frames so we here we have
30:54
three vectors x y and z
30:58
and they have different types so this is
30:59
a numerical vector
31:01
this is a character vector and this is a
31:04
boolean or logical vector
31:06
and we now can create this data frame
31:09
and say that the first element is x
31:12
the name of the first element is x and
31:16
there's the second element the second
31:18
column should be y
31:19
and the third one should be said sorry i
31:22
don't know why this has happened
31:23
all the time um
31:26
the third one should be that and what
31:28
you can see here
31:30
is how this is then represented if you
31:33
look at the output
31:35
of such a data frame so here is the
31:37
first vector
31:39
that's now the first column in our table
31:41
the second column result on our table
31:44
is this one and the third column of our
31:46
table is the boolean vector
31:48
now this is a data frame is it
31:50
essentially the
31:52
r version of a table and the only
31:56
and internally this data frame is
31:58
basically a list
32:00
of vectors that have the same length
32:07
okay so this is a data frame so and this
32:10
data frame looks close to what we could
32:13
be
32:14
dealing with or is actually what we
32:15
could be dealing with or
32:17
so sufficiently general to allow us to
32:20
store any amount of uh
32:24
experimental data now like a table is a
32:26
general way of storing data
32:28
and these data frames are essentially
32:31
tables
32:32
and they can they allow us to store any
32:35
kind of
32:36
data that we might measure
slide 12
32:40
now the problem is now we have this data
32:42
table but what
32:43
as i told you in the beginning these
32:45
data frames
32:46
that i told you in the beginning that we
32:48
actually have
32:50
data data that is terabytes in size
32:55
so now we need a way of performing
32:59
operations on these tables in a very
33:02
efficient way now so we need the right
33:05
methods
33:06
and how important that is to pick the
33:09
right methods is here
33:11
on the left hand side so here you can
33:15
see these bars
33:16
and these bars are time measurements
33:19
that it takes to perform certain
33:22
operations on data size of
33:24
certain size yeah and
33:30
so so here this is 500 megabytes of data
33:33
so
33:34
pretty small data set and then you
33:36
measure this the length
33:37
of such an operation here that's denoted
33:39
by the bar
33:41
and you can compare different versions
33:44
different packages that allow you to
33:48
to look at this data so for example
33:52
one popular popular method in r is
33:55
called
33:56
dplyr the varia is extremely popular and
34:00
very easy to learn
34:01
way of managing these tables
34:05
another pandas
34:08
is basically the are the python version
34:12
of such a data frame and then we have
34:16
here data
34:17
table on the top the blue one
34:20
now so this is a very fast c
34:23
implementation of these
34:24
operations on these data frames it's
34:27
very fast and memory
34:28
efficient so identity you have some
34:31
things that are more
34:32
used and companies
34:35
uh also in this list now let's but this
34:38
all seems very reasonable so we have
34:41
here 12 seconds
34:43
20 seconds 90 seconds
34:46
nothing of that stops us from doing from
34:48
getting results
34:50
now we increase the size of our data set
34:54
to 5 gigabytes or 50 gigabytes still
34:57
very small compared to what i've been
34:59
talking about
35:00
so here look let's let's have a look at
35:04
these 50
35:05
gigabytes now suddenly
35:08
there's a huge difference here
35:11
a lot of these packages a lot of these
35:13
methods you do not produce
35:15
any results at all yeah
35:18
they they run out of memory for example
35:22
and some of them like this very popular
35:25
one
35:26
takes just a huge amount of time
35:30
while others especially this data table
35:34
method here
35:34
performs very well so we have here
35:39
what is that 30 minutes so that sounds
35:42
reasonable
35:45
no no that's seconds there are 13
35:46
seconds so 13 seconds so this was
35:49
sorry so this was uh 0.2 seconds here at
35:52
the top
35:53
and now we're at uh at 13.56 because
35:56
that sounds pretty reasonable
35:58
but if you go down here to these methods
36:00
that we have here so the 13 seconds is
36:02
something that you can
36:03
wait in front of the computer
36:07
and still have some interactive way of
36:09
working
36:10
now if you go down here of course a lot
36:12
of these methods just fail
36:14
but there's also some of them just takes
36:16
so long
36:17
that any reasonable working like this
36:20
diploid here
36:21
takes so long that any reasonable way of
36:23
working with data
36:24
uh is that not possible yeah
36:28
so that's why choosing the right method
36:31
here
36:32
is uh important
36:35
and also what's important is to look at
36:37
how these methods
36:39
scale with the size and the complexity
36:41
and the dimensions of your data
36:44
so what this tells us here is there's a
36:49
huge difference
36:50
yeah and actually what i did when i
36:54
was a poster for example like every
36:56
physicist i used matlab i was used to
36:58
use
36:58
matlab yeah and then very quickly almost
37:01
immediately
37:02
matlab failed on such operations
37:05
yeah and then when these genomics
37:07
methods came up
37:09
yeah i used the red one here dipper
37:13
this is extremely popular and very easy
37:16
to get into it's very flexible
37:18
and very well documented and so on and i
37:21
used that
37:22
but then experimental progress moved on
37:25
exponentially now so when i started i
37:28
was looking at like data
37:29
a few hundred megabytes of size and
37:32
at the beginning of this talk i showed
37:35
you data that were
37:36
like a terabyte of size yeah and very
37:38
quickly i went into this problem here
37:41
that i wasn't able to get any results at
37:44
all and
37:44
meaningful times so at some point i had
37:46
to rewrite all my code
37:48
and what then turn out to be a
37:51
reasonable choice for really large data
37:53
is this data table package it is our
37:56
version
37:57
and also the python version as you see
38:00
here
38:01
is not much slower than that than the r
38:04
version
38:05
this data table is written in c is very
38:07
fast
38:08
and uh once you if you
38:12
stay strict to it then you can expect a
38:15
performance that is not much worse
38:17
than actually c but with orders of
38:19
magnitude less
38:22
programming work so
38:25
today i'll use this data table package
38:28
here
38:29
and i will also introduce some concepts
38:33
here
38:34
that are applicable to any other of
38:37
these methods or any other way
38:39
of dealing with data actually
slide 13
38:43
okay so let's use this data table
38:46
package
38:47
now this is very fast and and works for
38:50
extremely large
38:52
uh data sets the downside is that this
38:54
syntax is not very friendly for
38:56
beginners
38:57
it's a little bit cryptic it's very
38:59
condensed and very compact
39:01
uh but it turns out that this at least
39:04
for me was the only
39:07
way that i could uh deal with data
39:10
like in an experimental field where the
39:13
sizes of data sets are exponentially
39:15
increasing
39:16
this was the only way that made me work
39:19
in an efficient way yeah so we start
39:22
this data table
39:23
we reload this package just by calling
39:26
library
39:26
data table no and then we if we have a
39:30
data frame we can just convert it to a
39:32
data table
39:33
by this command as data table that's
39:36
very simple
39:37
nothing else is necessary
slide 14
39:41
so in this lecture uh i would like to
39:44
look at some real
39:45
data yeah so and this real data i could
39:48
now load like one terabyte of data of
39:51
experimental data
39:52
and uh but that would not be very
39:55
efficient actually for this lecture
39:56
because it would be very
39:59
slow and also very difficult
40:02
to follow so i'll look here at a simple
40:05
data set
40:06
and we'll do some operations on this
40:08
very simple data set and this is
40:10
actually data set that any of you can
40:11
download
40:12
and i'll also upload or share a link
40:15
where you can download the code of this
40:18
lecture that i'm
40:19
using in this lecture and the data
40:22
itself
40:24
okay so this is the data set this is a
40:26
called
40:27
new york city flights and these are all
40:31
flights
40:32
that departed
40:35
or arrived in new york city airports
40:39
in 2013.
40:42
yeah so the this uh this one that this
40:45
data set consists of
40:46
several files of several tables
40:50
one is the actual information about
40:53
these flights here
40:55
yeah so we have for each slide we have
40:57
information about the year
40:59
that's 2013. but we also have
41:02
information about the month
41:03
the day the hour we have an information
41:07
about the flight number and identifier
41:09
of the flight
41:10
we know we can uh know
41:14
where it came from this flight
41:17
now we have a tail number that is that
41:19
would that mean that we can identify
41:21
the airplane that was used the carrier
41:25
so is it united airlines american
41:27
airlines
41:28
lufthansa and then we have some
41:30
information about the delays
41:32
in these slides so delays in the
41:34
departure delays in the arrival
41:37
how long was the airplane in
41:40
air and many other bits of information
41:46
now this is the information for all
41:48
flights
41:49
now we want to interpret this
41:51
information that means that we need to
41:52
connect this information to other
41:55
uh other uh if you want to understand
41:59
for example where these
42:00
delays come from in this data set
42:04
we want to connect it to additional
42:05
information the one thing you might be
42:08
looking at
42:09
is the weather now if you ask for
42:12
whether why is a flight delayed we can
42:14
look at the weather
42:16
yeah and this weather is a different
42:19
source of data
42:20
and for this weather we can also get the
42:22
time
42:23
you know the month and day and hours we
42:26
can
42:27
look at the weather at a certain
42:29
airports
42:30
you know at certain locations we have
42:33
temperature humidity wind speed and
42:36
other information
42:37
about the weather at a certain location
42:39
at a certain time
42:42
now we have information about the planes
42:44
so we have this
42:45
tail number here so that it identifies
42:48
the planes
42:49
and then we can also download some
42:50
information about these airplanes
42:52
we can for example look
42:56
what manufacturer the airplane was made
42:59
we can look at the model we can look at
43:02
the
43:03
the year this airplane was built number
43:05
of engines the number of seats
43:08
the type of airplane and so on
43:11
we can get some information about the
43:13
airport
43:14
where we have uh airport information
43:17
an airport identify the name of the
43:20
airport
43:20
the longitude and latitude and altitude
43:23
of this
43:24
airport and so on and finally we can
43:27
also get some information about the
43:28
airline
43:29
now that corresponds to a certain flight
slide 15
43:35
yeah so and now one thing you need to
43:37
notice
43:38
is that these tables here they share
43:41
information
43:43
yeah for example if we want to learn
43:47
uh what the weather was for a given
43:49
flight
43:50
then we can look at the year and the
43:52
month of the day
43:54
at the origin airport of this flight
43:58
and look and compare that to the same
44:00
columns
44:01
in this other table here
44:04
on the left hand side that contains the
44:06
weather information
44:08
now if we want to learn about the
44:10
airplane that was used
44:13
we can take this column here this bit of
44:15
information the tail number
44:18
and compare that to
44:22
this the corresponding tail number in
44:25
this planes data set
44:26
and look what kind of manufacturer this
44:28
plane was uh made by
44:30
uh which year it was produced and so on
44:34
yeah so these bits of information are
44:36
connected
44:38
uh by different variables and we can use
44:41
these overlaps between these datasets to
44:43
bring them all together later
44:46
but first let me show you how to load
44:48
data and loading data is actually also
44:50
not a trivial task
44:52
now the loading data can take hours or
44:55
minutes
44:56
depending on what function you use for
44:58
that
44:59
and in the scope of r the fastest
45:02
functions
45:03
are the ones from the data table package
45:07
they're called
45:07
f read now they're an f right that
45:10
allows you to write data
45:12
and they're basically parallelized
45:14
versions of reading
45:16
huge amounts of text data of ascii files
45:20
and it's very simple you just use the
45:23
command
45:23
f3 it's called fast read to load a
45:26
certain file here for example the text
45:29
file
45:30
and f read will do all of the rest
45:34
uh that you need to do that you need to
45:35
do it will identify the columns
45:37
will identify that the data types and so
45:39
on typically you don't need to do
45:41
anything
45:43
for example we can read in this flights
45:46
data sets
45:47
here that contains information about
45:50
flights
45:51
and it's actually the flights that
45:52
departed from new york city
45:54
and this is then how this data set looks
45:58
like
45:58
if we just print what is in once we load
46:01
these
46:02
flights we can we can print what is here
46:05
and see what's in there
46:07
we have these different columns they are
46:08
like 2030 that's the year
46:11
then we have different months we have
46:12
different days of the months
46:15
and we have departure times
46:19
and we have real departure times we have
46:22
scheduled departure times and we have
46:25
delays
46:27
now we have arrival times and so on we
46:30
have here carrier we have flight numbers
46:33
the airplane the tail number of the
46:34
airplane and so on
46:37
origin destination and so on yeah
46:41
and we have a timestamp here a time
46:43
signature
46:45
of when this flight actually took place
46:49
now so let's go uh
46:53
to our studio and see how this looks
46:55
like
46:56
[Music]
46:58
um
47:01
here we go now this is our studio i hope
47:04
you can see our studio now it should be
47:06
still
47:07
sharing yeah and uh
47:11
so here the top rows so okay so this was
47:14
a little bit small right
47:16
um let me zoom in
47:25
okay so i hope this is better um
47:29
so here we're loading
47:33
all of these files actually the same
47:35
thing that we did
47:38
[Music]
47:39
that i had on slides so for each of
47:41
these different bits of information
47:43
we now load these files into memory
47:47
now so i had already done that before so
47:49
here they are
47:50
in the workspace and we know we can
47:52
click on these files so we can
47:53
either now just type flights
47:57
yeah and get this huge table out
48:00
or you know so we see we have here
48:05
300 more than 300 000 flights that we
48:08
have information about
48:10
or i can just click on this file here
48:14
and get a little nice table view of
48:17
the data that we have now this is the
48:20
flight information
48:21
it's also high dimensional not as
48:23
extremely high dimensional
48:25
as the example i gave you but it has
48:27
still enough
48:28
information that we cannot easily detect
48:31
patterns
48:32
in this data set yeah so this is
48:36
this is the data we have 300 000 flights
48:40
and now we can try to detect some
48:44
structures
48:45
in this data set
48:49
let's get back to powerpoint
48:55
okay here we go now we've loaded the
48:58
data we've got all of these different
49:00
data tables now or data frames and
49:03
memory
slide 16
49:04
and now we're trying to make sense we're
49:06
trying to make sense out of these
49:08
statuses
49:09
the first thing that we always need to
49:12
do
49:13
is to bring data in a format
49:16
that we can deal with it that the data
49:19
and this format is typically called date
49:22
tidy data or a long data or long format
49:27
now that two simple rules that you can
49:31
use to decide whether
49:32
a data table or a data framework or
49:35
table is tidy
49:36
so one thing you need to make sure is
49:38
that every column
49:41
is a different observable for example
49:44
different types of measurements
49:47
every row are then different
49:50
observations of these measurements of
49:54
these observables or different or
49:56
variables
49:57
yeah once you stick to these rules your
50:00
data is tidy
50:02
and when your data is tidy so every
50:05
column
50:06
is a different observable and every row
50:09
is an observed observation
50:11
then we can use vectorized operations
50:15
in r or python to perform operations
50:19
on entire columns of these
50:22
data tables yeah and that
50:26
first is extremely fast you know because
50:28
it's vector-wise we've informed these
50:30
operations
50:31
one column at a time and it drastically
50:34
reduces the complexity
50:36
of the code now the
50:39
the data that i showed you are one of
50:41
the standards
50:43
data sets and data signs and uh
50:46
that that you you look at for for
50:49
teaching
50:50
and uh these date this data is already
50:53
tidy
50:54
so let's have a look now at these
50:56
flights
50:57
so every column is a different
51:00
observation this is a different
51:02
observable yeah for example a different
51:04
measurement
51:05
a different departure time a different
51:08
arrival time a different flight number
51:11
a different time points
51:15
and every row is a different observation
51:19
so a different flight
51:21
so this data table is already in a tidy
51:24
format
51:26
and it's already in a format that we can
51:28
deal with so we have nothing to do here
slide 17
51:31
in a standard example of a non a messy
51:34
data set
51:36
that we also often find in physics but
51:38
also here in this case in
51:43
in in genomics is that we have
51:46
matrices that we share data as a matrix
51:50
for example in this matrix here that's a
51:52
typical matrix here
51:54
so this is a measurement one of these
51:55
genomics measurements so here on each
51:58
row
51:59
is a different gene now so and the first
52:02
column here is the name of this gene
52:05
that's a very cryptic name these genes
52:07
have cryptic names
52:10
each row is in gene and each column is a
52:13
different cell
52:16
yeah so this data set
52:19
is not tidy
52:23
because the same measurement so sorry
52:25
that these numbers
52:26
mean how many products of this given
52:30
gene
52:31
we have measured in a given cell
52:35
so these are the numbers it's not
52:37
important to understand the biological
52:38
context
52:40
and but this data set is not tidy
52:43
because the same
52:45
observable namely the number of these
52:48
these gene products is repeated in
52:51
different columns
52:54
and you can easily see that by having
52:57
this format
52:59
we're not very very flexible so if i now
53:01
for the same gene
53:03
and for the same cell make another
53:05
measurement
53:06
what do i do where do i put that in this
53:08
matrix so we have
53:10
then we have to have a separate matrix
53:12
and i have to live with separate
53:14
matrices
53:15
in parallel so this data set is not
slide 18
53:18
it's not tidy and
53:22
i cannot perform easily parallelized
53:25
operations on this data table and now we
53:27
want to make this data table
53:29
tidy i know there are special functions
53:33
that allow us to do that and
53:37
these functions are
53:41
basically i just give you the names and
53:43
they have similar names basically in all
53:45
contexts to make such a data table tidy
53:48
like the one that i have here this
53:50
function is called
53:52
mels it's called you also said you melt
53:54
a data table
53:56
and that brings us from this matrix
53:58
format
54:00
to a format where every
54:03
column represents an observation and
54:06
each row
54:08
represents i sorry each column
54:10
represents an observable
54:11
like a measurement and each row
54:14
corresponds
54:15
to a measurement or an observable yeah
54:18
so
54:18
the right would you see on the right
54:20
hand side that would be
54:22
the tidy version of the data that show i
54:25
showed you on the
54:26
on the last slide so the first column
54:28
would be the cell
54:30
the second is the gene the name of the
54:32
gene the second column would be the name
54:34
of the cell
54:35
and the third column would be the value
54:38
that i measure
54:39
for the g products now so that would be
54:42
the tidy format
54:43
and now i can have a long i have a long
54:46
vector here
54:47
of numbers in this
54:50
one column and i don't have 200 columns
54:54
as in the previous slide
54:57
okay so how do we do that in our this on
55:00
data in this data table package the
55:02
command is called
55:03
melt we just give it the name of our
55:06
data table
55:08
and then we identify id variables
55:11
so these are variables uh
55:14
that need to remain so to say intact
55:19
uh once we once we reshape this matrix
55:22
now in our case this is the id column
55:26
yeah and the value name is
55:30
uh how we want to call
55:33
the values that are in these matrix
55:36
elements here now and we just call this
55:39
expression now that's what we call
55:42
and the variable name is the cell that
55:44
we have here already
55:46
that tells us that this variable name is
55:48
distributed
55:49
now into this column now and then if you
55:52
look at these colors here
55:54
you can see how this command distributes
55:56
what is actually now distributed
55:58
into several columns
56:01
how this reshapes this data table into
56:04
something that has a longer format it's
56:06
called actually also called long format
56:09
uh where we have uh the cell name
56:12
this column and then the corresponding
56:14
measurements in this column
56:17
for the products yeah you can look at
56:19
this i'll upload the slides of course so
56:21
you can look at this
56:22
uh in detail afterwards
56:25
it's a little bit abstract but this is
56:27
so say how we reshape
56:29
data tables to make them tidy
slide 19
56:33
yeah and we can also uh make them messy
56:37
again now so i'll go quickly over that
56:39
because that's not something we use
56:41
but sometimes you just want to have your
56:43
good old matrix back
56:44
because some functions some other
56:46
package wants to have a matrix
56:48
yeah and that's called then this
56:50
function is called d
56:52
cast and that just reverses
56:55
uh that reverses the operation that i
56:58
told you on the last slide
56:59
where we give you where this function
57:02
takes the data table
57:04
uh the name of the data table variable
57:06
as the first
57:07
argument and then a formula of how the
57:11
rows and column columns
57:13
should be now separated in this new
57:17
matrix that we want to have now it's not
57:20
so
57:20
important so i'll not go into it it's
57:22
not so we
57:24
rarely use that and
57:28
now we have some mess some tidy data
57:30
yeah so
57:31
we have some tidy data now every column
57:34
is an observation every row sorry i
57:38
always mess it up
57:39
every column is an observable or a
57:41
variable
57:42
every row is an observation
slide 20
57:45
once we're in this format we can now
57:47
have a very simple syntax
57:50
now and this syntax actually captures
57:53
ideas
57:54
that uh we also have in other
57:57
maybe more popular languages like sql or
58:01
sql that's that's
58:04
more that captures operations on this
58:07
data table
58:08
that are quite generic so so this is the
58:11
general syntax that we use in this data
58:14
table package
58:15
so we have three now we take this data
58:18
table
58:18
d and in squared brackets
58:22
we have three arguments
58:25
yeah and the way you read these three
58:28
arguments
58:29
is that you take the data table d
58:32
and then you have something at the first
58:34
argument that operates
58:36
on the rows yeah you have here a
58:39
condition that
58:40
tells you which kind of rows you want to
58:42
take
58:44
the second one operates on the columns
58:47
yeah and for example you can in the
58:51
second
58:51
argument you can perform a calculation
58:55
on the columns and the third ones the
58:58
third argument
58:59
is a grouping variable now the third one
59:02
is it's like it's like a typical matrix
59:05
a statement y and j the third one is a
59:07
grouping variable
59:09
where we can group rows together by
59:11
certain conditions
59:12
and perform these calculations here
59:16
group by group now so the way to read
59:19
that is to take
59:20
the data table d subset the rows using i
59:24
that can be some expression some logical
59:27
statement for example
59:28
we calculate what is in j
59:32
and do this calculation group grouped
59:36
by whatever is in this third argument
59:39
and i'll now show you step by step
59:41
how this looks in detail
59:44
and these uh arguments here these three
59:48
albums are
59:48
a very very general way this is also say
59:52
parametrize um operations on these
59:56
large tables now so with these just
59:58
three arguments we can do
60:00
a lot of stuff already and if you think
60:03
about it it's actually
60:04
abstract but it's very simple
60:09
so now first let's have a look at the
60:11
very first argument
slide 21
60:15
yeah so let's just say we have a table
60:18
and we just want to get
60:19
a subset of rows from the table
60:22
now for example we want to look at all
60:26
planes
60:26
that have four engines
60:30
now the way we do that is we just
60:33
write this logical statement here as the
60:36
first argument
60:38
so engines equals four
60:41
and if we do that then we get back a
60:43
data table that only has the planes
60:48
that have for four engines
60:51
in them that have four engines now we
60:53
filtered
60:54
the rows of these data tables for the
60:57
rows that have
60:58
the value of the engine column equals to
61:02
four
61:04
now we can also select a subset of
61:07
columns here on the right hand side
61:10
and this we do by this notation here
61:12
this dot is actually a shortcut for
61:15
for for and for a list yeah so
61:18
what we do is we give a list of the
61:21
names of
61:22
columns that we want to extract
61:26
as the second argument so for example we
61:30
take this planes data set
61:32
and we want to get the tail number and
61:34
the year
61:35
this airplane was built and then we get
61:38
a shorter
61:39
a smaller data table back that only has
61:42
these two columns
61:44
in them yeah so this is all something
61:46
that you could also do with any other
61:48
package or any other programming
61:50
language of course
61:54
now we want to get a little bit more
61:58
complicated
slide 22
61:59
now we want to do an operation on these
62:01
data tables on these
62:02
large data sets and these operations in
62:05
general have a quite general scheme
62:09
yeah so and this general scheme is
62:11
already
62:12
in this syntax notation that i showed
62:14
you a few slides ago with the i
62:17
and j and the group is already
62:21
so say in there so that the the steps
62:26
that we take
62:27
is called the group aggregate combine
62:29
scheme
62:30
so what we do the general operation on
62:33
such a data table looks like that we
62:36
first group our observations
62:40
by some meaningful way so for example we
62:43
want to group
62:44
all slides that happen in the same month
62:50
yeah and then we aggregate aggregates
62:52
means that we
62:53
put extract all of these groups all
62:56
flies from the same month
63:00
at a time and aggregates mean at the
63:04
third step that we
63:05
perform some kind of summary function
63:08
on these flights that departed in a
63:11
certain month for example we want to
63:13
then calculate
63:14
the average temperature or the average
63:16
delay
63:17
now this is the third thing here where
63:19
we aggregate
63:21
all of these different flies that belong
63:23
to the same group
63:25
and calculate a single result from that
63:30
and i'll also give you now a specific
63:32
example of this
slide 23
63:34
so what we can do for example
63:37
is that we can group by carrier so we
63:40
want we might want to
63:42
ask are all carriers equally bad
63:46
or equally good or is there one carrier
63:49
that is worse than other carriers
63:51
now the way we do that is we
63:55
first group that's the last column
63:59
here we group all slides
64:03
and we take the flights data table and
64:06
then we group the flights
64:08
all flights together that have the same
64:10
carrier
64:13
and for each carrier we then calculate
64:17
the average delay using the mean
64:20
function
64:21
now we take the mean of the departure
64:24
delay and this third argument
64:27
just tells us that we should remove an
64:30
ace so
64:30
and not the numbers for example we're
64:32
not a proper measurement was taken where
64:34
this information is missing
64:36
now we just just ignore this yeah so for
64:39
every
64:40
carrier we take all flights
64:45
and calculate the average delay time
64:50
now so this is what we do here now all
64:52
carriers
64:54
that's here the grouping is the third
64:55
variable and then we perform this
64:58
operation and the second variable
65:00
because it operates on the columns
65:02
and the operation we do is we calculate
65:04
the average
65:05
of the delay and we save it in a new
65:08
column
65:09
that's called mean delay
65:13
now what we get from that is for every
65:15
carrier
65:18
we get one number which is this average
65:20
delay
65:23
now and you can see that basically that
65:26
confirms what you
65:27
expected already that united airlines is
65:30
not performing very well
65:34
now we can also have more complicated
65:36
procedures now we can for example
65:39
uh combine that with only
65:42
with a subsetting in the roads we don't
65:44
want to take all flights
65:46
we always want to look we only want to
65:48
look at flights
65:49
in the evenings you know very late
65:51
flights
65:53
and then we can for example do the same
65:55
thing we uh
65:57
yeah and we can we can also have a more
66:00
fine grade
66:02
fine-grained or we can look at different
66:06
combinations of variables here for
66:08
example
66:09
in this case here we
66:12
take all flights that
66:15
depart after 8 pm
66:21
and the remaining flights we group by
66:24
month at the origin airport
66:29
and we again calculate the average
66:32
departure delay
66:35
and now for every combination of month
66:38
and airport
66:40
we get a value for the average delay
66:42
here at the bottom
66:44
now for example in january uh the
66:47
newark airport had an average delay
66:50
of 14 minutes
66:55
yeah and jfk was doing much better
66:58
yeah and so you can see that this also
67:02
kind of is something that that people
67:05
expect to see here actually
67:07
yeah um okay so this is how you how this
67:12
group aggregate combined paradigm
67:16
can be fit into very compact syntax
67:20
on basically arbitrary complex tables
67:25
now we can also so this was
67:29
now a summary statistics where we
67:31
performed where we
67:32
summarized several rows yeah all flights
67:36
that have the same carrier
67:37
or all flights in a different in a
67:39
certain month and a certain
67:41
airport we summarize and summarize
67:44
all of these slides into just one
67:47
quantity for example the average delay
67:51
so we perform the summary here what we
slide 24
67:54
can also do
67:56
is we can calculate we can add a new
67:59
column to our table
68:02
that has one new value for each
68:06
row that is in the original table so we
68:09
don't perform
68:10
a summary it's also sometimes called
68:12
windowing
68:13
for whatever reason and so we have
68:16
one new value for each row that we had
68:20
in our original data set for example
68:23
here the top row
68:24
if i don't want to have the flights
68:27
uh if i wanted to have this the speed in
68:30
kilometers per hour
68:32
now i can for example then get uh
68:35
calculate the distance over the air time
68:39
in minutes
68:40
now that's the speed and miles per
68:42
minute
68:44
and then i multiply by 60 and convert to
68:48
kilometers
68:49
and then i have the speed for every
68:51
flight i have the average speed in
68:53
kilometers
68:53
per hour i can also rescale for example
68:57
i can combine
68:59
now these simple computations here with
69:01
the grouping
69:02
for example i can rescale all distances
69:07
by the average distance
69:11
added of a certain carrier for example i
69:14
can ask
69:15
is this flight much further
69:18
or much shorter than a typical flight
69:21
conducted by the same carrier yeah so
69:25
then i can calculate a rescale
69:27
distance that is just the distance
69:30
divided by the average distance
69:33
of this carrier here and then again i
69:36
get a new
69:37
variable a new value for each row that i
69:41
had in my original table
69:44
and the way i do these operations is by
69:46
these symbols here these define
69:48
symbols column and equal signs
69:52
now the way this looks like is now that
69:54
we have fear
69:56
our original 300 000 rows
70:00
but for each row we have calculated now
70:03
a new
70:05
column a new value that we save in a new
70:07
column
70:08
speed uh kmh uh
70:11
that is the speed in kilometers
70:15
and we also have the the rescale
70:17
distance
70:19
now for example this flight here was a
70:21
little bit ten percent shorter
70:23
than the average flight of this carrier
70:27
now and uh this also is saved into a new
70:31
column
70:32
that we've given these names here
70:35
now this is something that is that is
70:38
also happening very fast
70:40
and memory efficient
70:45
so now comes now comes more complicated
70:49
things so this was in principle easy
70:52
step we add more columns
70:53
we perform a summary and this data table
70:58
package allows us to write a very simple
71:01
syntax
71:02
in order to do that now if you try to do
71:05
these groupings
71:06
in c also you write many many lines of
71:08
codes
71:09
to do that now
slide 25, 26, 27, 28, 29
71:12
the next step we want to make use that
71:16
these that these to get some more
71:19
understanding about these delays here
71:22
in new york city we want to make use of
71:24
the fact that these whether we have
71:26
different bits of information
71:28
and these different bits of information
71:31
share
71:33
columns or share information that allows
71:35
us to link them
71:38
yeah for example we can link the weather
71:42
to a certain flight by matching the year
71:45
the month the day and the hour
71:49
and the airport yeah
71:52
and with this we can connect the flight
71:56
to the weather that that occurred when
71:58
this flight
71:59
departed yeah we can also
72:05
connect the airport information for
72:07
example
72:08
so this faa is an identifier that has a
72:11
different name in this table
72:13
but it's essentially the same as in the
72:15
origin airport
72:16
and we can link we have the information
72:18
to link these tables together
72:21
now the planes the the tail number is a
72:24
unique
72:24
identifier that allows us to look up the
72:27
plane
72:28
of our at our of our flight
72:31
in this data table this data set of all
72:34
plates
72:36
and also the same with an airline we can
72:38
look up the carrier
72:40
identifier in another data table and get
72:43
the name of the carrier
72:45
now so these these tables are
72:48
interlinked
72:49
and now we need efficient ways to merge
72:53
these these big different bits of
72:55
information together
72:58
yeah and this merging happens
73:03
in a way that is also often called joins
73:05
now this is a general principle that you
73:07
do basically
73:08
in any kind of of data handling uh
73:11
task independent of the
73:14
programming language that you're using
73:17
and
73:18
the first join and then you can perform
73:21
these joints or these mergings of these
73:23
data tables
73:24
in different ways and these different
73:27
ways
73:28
differ in the way which information you
73:31
keep in case you only find it in one
73:34
table but not in the other table
73:37
now for example so let's begin with one
73:41
kind of join
73:43
it's called the left join
73:47
that's called the left join and what you
73:49
do
73:50
is that you keep all rows
73:53
that are in the left data table now in
73:56
this a here
73:58
now we want to combine in this case the
73:59
information in a
74:01
and b yeah and then we can look up
74:06
okay so here we have a value x1
74:09
that corresponds to a
74:12
and we have a value of x one a and now
74:15
we can look up
74:16
what is the other columns x two and x
74:19
three
74:20
now so x two is one and x three is this
74:23
is true
74:25
yeah and then we can combine these two
74:27
columns in the right way
74:29
that they correspond to the value of x1
74:32
a same for b now we can look up b
74:36
in both columns so they don't need to be
74:37
in the same position
74:39
in both tables we can look up the b and
74:42
get the values of the remaining columns
74:45
and sort them here to the end of this
74:47
table
74:48
and now we have the three the c
74:52
now c is in table a and the first one
74:55
on the left one but not in the right one
74:59
and a left join means that we take c
75:05
from the left one we don't have
75:08
information
75:09
from the white one so x3 is empty
75:13
and we disregard all rows that are on
75:16
the right
75:17
table but not in the left table so
75:20
that's a left
75:21
join and an r or in this data table
75:24
framework this is done by the function
75:27
merge that takes two tables
75:31
and the left join is then specified by
75:33
saying that
75:34
all x so all first column all entries in
75:37
the first
75:38
column should be true should be taken
75:42
there's also a short notation for this
75:44
joining
75:45
and that's just when you index one data
75:47
table
75:48
by another data table so it's a very
75:50
compact syntax
75:51
for performing a very complex operation
75:56
and this is something these joints are
75:57
also something that are not trivial
75:59
computation think you're doing that with
76:01
with something that has terabytes
76:05
in size yeah so in principle you have to
76:07
look up
76:08
all rows in both data table and find the
76:10
corresponding matches
76:12
now you need to find have uh very
76:15
efficient algorithms
76:16
to do that and depending on what you
76:19
choose here
76:20
which package you choose to perform this
76:22
algorithm you can wait for days
76:24
or you can wait for minutes now that
76:26
makes a huge difference
76:29
so if there's a left join there's also a
76:30
right join and right join just means
76:32
that we
76:33
keep all rows from the right table
76:38
but we disregard rows from the left
76:40
table if they're not
76:42
in the right table that's the right join
76:45
and
76:45
in all this just is that works by uh
76:48
also by the merge command
76:50
and we just give the argument that oh y
76:53
yeah that you know that we should all
76:55
keep all
76:57
uh columns in the second y data set
77:01
and there's also then a short notation
77:03
for these right joints
77:07
there's so called inner join that's the
77:10
most restrictive joint
77:12
and this
77:16
inner join just keeps values
77:19
keeps rows where we have information
77:22
in both tables now if we don't have
77:26
information in both tables
77:28
then this row will not end up in the
77:30
final
77:31
data set for example c only is in the
77:34
left
77:35
table d is only in the right table
77:40
yeah and then we uh
77:44
none of these ends up in the final
77:46
results
77:48
that's the inner join yeah and uh
77:52
so again here there's a merge command
77:54
for that we just say all
77:56
equals false and there's always a
77:58
shorthand notation
78:00
with an additional argument for that
78:04
so in all of these commands here i
78:05
didn't specify
78:07
the column that we should use for
78:09
joining i didn't specify
78:11
that x1 is the column we should use for
78:14
comparing
78:15
now i didn't specify for example if you
78:18
look here
78:20
which column we should actually use to
78:22
match slides to the
78:23
together and i didn't do that
78:27
bef because this column has the same
78:30
name
78:32
in both data sets so x1 pops up
78:35
both in a and b and then r automatically
78:40
assigns these columns these columns that
78:42
have the same names
78:44
uh uses them to match these different
78:46
tables
78:48
so if there's an inner join there's also
78:50
an outer join
78:52
or a full join and this full join
78:55
retains
78:55
all values from all rows
78:59
so that it happens if we just now join
79:02
or merge these two tables
79:04
then we get our column x1 that is shared
79:07
between these tables
79:09
and our column x2 from this first table
79:13
where we don't have a value for d
79:17
and the column x3 from the right table
79:21
where we don't have a value for c
79:25
now so this retains the most information
79:32
okay so now we can do some
slide 30, 31
79:36
merging so now we can for example ask
79:39
are these delays affected by bad
79:43
weather now so the first thing we need
79:45
to do
79:46
is to merge the flights
79:49
and the weather data sets because they
79:52
share columns with the same names
79:56
for example time hour or so of
79:59
our
80:03
whereas the origin airport now since i
80:05
didn't plot that here they share columns
80:07
with the same names that the ones that
80:08
are connected with these arrows
80:10
that works automatically so i merged
80:12
them
80:14
and now i have a new table that
80:19
contains information about the flight
80:22
the carrier
80:23
and also the delay but also at the same
80:27
time
80:27
the columns from the weather data set
80:30
that have information about
80:32
wind speed and precipitation and about
80:34
the temperature
80:37
now and now we can go on and ask for
80:40
example
80:42
let's group this combined
80:45
table here let's group that
80:49
on the one hand by the rows that have a
80:52
wind speed larger than 30
80:55
and by wind speeds smaller than 30.
81:00
yeah and then for each of these two
81:04
groups
81:05
calculate the average departure delay as
81:08
before
81:11
now so if the wind speed is smaller or
81:14
equals to 30
81:15
here then the departure delay was 12
81:19
minutes
81:21
if the wind speed was larger than 30
81:24
then the departure delay was 28 minutes
81:27
so much larger
81:29
and if i couldn't evaluate
81:33
this condition here for example if i
81:35
don't
81:36
have information about wind speed yeah
81:39
then i get
81:39
also a value for these nas for these
81:42
non-available data
81:44
so that means that that if it's very
81:47
windy there's a
81:48
higher chance or there's a typical
81:50
higher average delay
81:52
in these flights in new york city
81:57
now we can also ask we can also merge
82:00
the flights table with the plain table
82:09
oh so is anybody stuck at the joint
82:11
slide
82:18
now so which what is the slide that
82:19
you're seeing at the moment
82:24
are planes equally reliable
82:28
our place equal so that's the right so
82:30
matthew matthew is having a problem
82:32
um but at least some of you are seeing
82:36
the
82:37
the right slide okay um
82:42
okay so we can merge the flight
82:44
information
82:46
with information about planes
82:50
and now i give this additional column
82:52
here this additional
82:53
argument i tell the merged command to
82:56
use this column tail number
82:59
to to match the information to match the
83:01
rows
83:03
now if i do that then i get also my
83:05
flight information about the delay
83:08
about the carrier and so on but i know
83:11
also get information about the
83:12
manufacturer
83:14
about the model number and about
83:18
the year this airplane was created
83:21
and because we had already a year column
83:24
namely the year of the flight
83:26
we now have two different columns your x
83:29
and your y
83:30
your x is the column of the flight when
83:32
did the flight take place
83:34
and your why is the year when
83:37
the airplane was built
83:41
okay and now we can just
83:44
do our calculation now we can ask are
83:48
certain manufacturers uh more
83:51
or less reliable are the planes by
83:53
certain manufacturers more or less
83:55
reliable
83:56
yeah so that was we do the same thing
83:59
yeah so we group by manufacturer
84:04
now by this column here we group now the
84:07
rows
84:08
and then for each group we calculate
84:11
the average departure delay
84:16
here and save it into a new column mean
84:18
delay
84:20
and in this case we also calculate the
84:21
standard error now which is
84:23
just the standard deviation divided by
84:25
the square root
84:26
of the sample size now we can have two
84:29
computations or as many computations as
84:31
we want
84:32
in the same row yeah and here we see
84:36
that okay so we have here airboroughs
84:38
industries i don't know what
84:39
i forgot what is actually the uh the
84:42
outcome of this
84:43
boeing is a little bit later than airbus
84:47
and then we have here two two times uh
84:49
airbus
84:51
and uh the problem now here is that
84:54
airbus apparently changed its name and
84:56
one of them is older than the others and
84:58
that's why they have different delays
85:01
now these are smaller airplanes here
85:03
like
85:04
at this m and blair and bombardier these
85:07
are smaller planes
85:08
and these smaller planes have higher
85:12
delays now i can ideas like
85:15
uh i'm not kind of there there's also a
85:18
smaller
85:19
yeah and some of these smaller planes
85:21
seem to have
85:22
higher delays
slide 32
85:25
yeah and then we can also do a trick now
85:28
so
85:29
typically you do many many operations
85:32
and such a processing of such data says
85:34
you do many many operations
85:37
after each other yeah and
85:40
what we typically do in programming we
85:43
do one operation
85:44
and save the result in a new variable
85:47
now then we do the next operation and
85:49
save the result in a new variable now we
85:51
do another operation and save the result
85:53
in a new variable
85:55
now this is very inefficient if you have
85:57
long pipelines
86:00
of maybe hundreds of different steps in
86:02
your analysis
86:03
and how you transform the data
86:06
yeah and in our
86:10
uh and also in other languages there are
86:13
ways
86:14
to chain operations after each other
86:18
now to make these pipelines in a very
86:21
efficient
86:22
and in condensed
86:25
condensed way in terms of syntax now so
86:28
there's a
86:29
there's a tailway that is implemented in
86:31
this data table package
86:33
now there you just change they just put
86:36
the squared brackets directly off with
86:39
each other
86:40
for example here we create a new column
86:45
wind speed and kilometers per hour and
86:48
here we just
86:49
multiply the wind speed in miles per
86:51
hour by
86:52
1.61 to get the kilometers per hour
86:57
and then directly in the next step we
87:00
group by month
87:01
and calculate the average wind speed in
87:04
kilometers per hour so these are two
87:07
steps and we don't have to save the
87:09
results and intermediate variables
87:11
we just put them down we chain them
87:13
directly after each other
87:16
so there's a more common way to
87:20
chain or to make these pipelines and
87:22
this is
87:23
in another package that's called
87:24
magritter
87:26
now we load this package just at the
87:29
beginning of our code
87:31
and this package does one thing it
87:33
provides us with an
87:34
operator this one here
87:38
and everything this operator does is
87:41
it takes whatever is on the left of it
87:47
and pass it is to the first as the first
87:50
argument
87:51
to the function that is on the right of
87:52
this operator
87:55
now so this is very very simple we take
87:58
what is on the left
88:00
and pass it to the function of the on
88:02
the right as the first argument
88:05
now and with this we can then have
88:08
longer chains if we want
88:10
now for example we take weather
88:12
calculate the wind speed
88:14
in kilometers per hour and then
88:18
use this chaining upper this this pipe
88:21
operator here
88:22
from this package and pause it
88:25
to the next argument here we
88:29
have here an updated data table with
88:32
this new
88:33
package and that that's what we take
88:36
and we pass it to the next step of our
88:39
analysis
88:40
where we calculate the average wind
88:42
speed
88:44
per month and we take the result again
88:48
and passes to basically any function
88:51
that we want and here we pass it to for
88:53
example the the head
88:55
function and this head function just
88:57
gives us the first five rows or so
88:59
of our data table now so with these
89:02
operators here you can make very very
89:04
long pipelines
89:05
and you still have your codes uh in a
89:08
very
89:10
basic you have a list then in your code
89:11
you have a list of steps
89:13
that you perform one after each other
89:16
and so your code is still
89:17
in a very convenient way that you can
89:20
still
slide 33
89:22
understand yeah so this is a more
89:25
uh more complex example of such a
89:28
pipeline here
89:29
that we take the flights data sets
89:34
now merge it with the weather data set
89:39
now merge it with a plane state merge
89:41
the result
89:42
now with the planes data set
89:45
merge the results with the air ports
89:48
data set
89:49
and here i have to specify the columns
89:52
yeah that because they have different
89:53
names in both data sets
89:56
and we merge it with the airlines data
89:58
sets at the very end
90:00
yeah and now we have a data set that has
90:02
all the different information
90:04
all the different source information the
90:06
same table
90:08
yeah and now we can for example remove
90:11
flights that don't have information
90:13
about the departure delay
90:15
and we can do the steps that we did
90:18
previously for example we can get the
90:20
speed of the plane the average speed
90:23
in kilometers per hour we get the
90:25
rescale speed so every faster or slower
90:30
than what is typical for a certain
90:32
airplane model
90:34
and a certain carrier now and we can
90:38
also calculate things like correlations
90:41
now so for example here
90:43
we calculate the correlation between the
90:46
temperature
90:49
and the rescape speed of the airplane
90:54
now and here what do we do here
90:59
ah so what do we do here so we calculate
91:03
the difference in speed of the airplane
91:08
for flights that have a delay larger
91:12
than 20 minutes
91:14
and where the delay is zero
91:18
or smaller than zero yeah so here's the
91:21
difference in the speeds that have a
91:23
large delay
91:24
and that have a negative delay or no
91:26
delay
91:28
i recalculate that by carrier that's
91:30
just an example here
91:32
and uh what we see here is that
91:35
uh now so here this this correlation
91:38
here
91:39
is typically positive now so this
91:42
correlation is positive
91:44
that means for some reason the warmer
91:46
the temperature
91:48
the higher the speed of the airplane
91:51
now there's a positive correlation yeah
91:56
now so whatever this means yeah and
91:59
also what we see here is that this
92:02
difference in speed
92:04
is very often positive now american
92:08
airlines are so it's positive
92:09
that means that these airplanes fly
92:11
faster
92:13
when they have a delay compared to when
92:16
they don't have a delay
92:18
and you can also compare the the
92:20
carriers
92:22
united airlines has a little bit more
92:25
speed
92:26
you know than than american airlines
92:29
and you can get all kinds of insights
92:31
here from that now that's of course not
92:33
extremely insightful yeah but it
92:35
illustrates
92:36
how you can do simple operations and
92:38
complex data sets
92:40
in just a few lines of codes and i don't
92:44
have to tell you that in c
92:45
or matlab you would be spend a lot of
92:47
time and a lot of lines of codes to
92:49
implement that
slide 34
92:52
yeah just to summary this part and uh
92:55
if we have time i'll show you a little
92:56
bit about plotting no we don't have time
92:59
um so i showed you that these data
93:03
frames are efficient ways of storing
93:06
high dimensional data of different kinds
93:10
and
93:13
depending on
93:16
where your data is stored and how large
93:19
is it
93:20
you can choose different tools yeah if
93:23
your data is stored locally
93:25
and it's very large and you rely on
93:27
speed then data table
93:29
is a good way to go if your data is
93:34
is for example on a server on the
93:36
internet
93:37
yeah and it's not that large then you
93:40
would use
93:40
other codes that are better at
93:42
interacting with net network uh
93:44
resources
93:45
yeah for example like this diplo package
93:48
yeah and the key step is actually always
93:52
to bring it to clean up the data to
93:53
bring it to this tidy format
93:56
and once you've done that we have this
93:59
split
94:00
aggregate combine paradigm where you
94:03
group
94:04
your data by different conditions
94:07
by basically any condition you want and
94:10
then for each group
94:12
you extract the corresponding rows of
94:14
the data table
94:16
and perform operations on this for
94:18
example to perform summaries
94:20
on this data on this on these rows
94:23
and the package that i showed you here
94:25
allows you to do all of these steps
94:27
in a single very short line of code
94:31
yeah and finally i showed you how uh
94:34
pipelines actually help you to create
94:38
to structure your code uh if it gets
94:41
very complex which is usually usually
94:43
the case
94:44
yeah and uh what i will do is i will
94:47
upload some slides
94:48
that shows show you how to visualize
94:51
data
94:51
now that's just i mean you all have your
94:54
favorite topic plotting packages
94:56
i'll upload to some slides that shows
94:58
you how to basically
95:00
uh how people developed a grammar for
95:03
graphics that allows you basically to
95:05
make an infinitely complex
95:09
clause or data visualization using a
95:12
very simple
95:13
grammatical construct so i'll upload
95:16
this on the website
95:17
and then let me know if you have any
95:20
questions
95:21
and otherwise see you all next week
95:25
so next week i should say so the next
95:26
week we go into a machine learning and
95:28
we'll have a guest
95:29
speaker which is our local which is
95:33
basically that the chief data scientist
95:37
from on one of the json institutes here
95:40
from the
95:41
center for regenerative therapies
95:44
and he'll share fabian us and he'll
95:46
share some of his insights
95:48
into how to use machine learning to
95:51
detect that
95:51
actually low dimensional structures
95:56
and high dimensional data sets now as i
95:59
see as you probably realize now we're
96:01
able to
96:02
do computations fast computations on
96:06
on huge data sets but what i showed you
96:10
was a little bit tedious you know
96:12
this was i actually didn't show you
96:13
actually how to come up
96:15
with uh low dimensional structures or
96:17
with order and data
96:19
yeah so and this is something we'll deal
96:21
with uh next week
96:22
and we'll have a guest lecturer then
96:25
which is uh which is fabian here from
96:27
from dresden
96:28
okay see you all next week bye
96:32
i'll stay online for a while to case of
96:34
any questions
96:43
uh i was wondering if um
96:47
if you dealt with like temporal data
96:50
or are most of the data that's coming
96:53
out of these experiments just kind of
96:55
you know there there's some static
96:58
information about the
96:59
genomics and proteomics and all that yes
97:02
so that's typically static data
97:04
but this although it's static it
97:06
contains dynamic information
97:09
yeah um indirectly so you have
97:12
measurements so of course like like very
97:15
often biology you have to kill the cells
97:17
that you actually want to
97:19
want to measure or you have to kill the
97:21
animals that you actually want to
97:23
look into and but you have some
97:26
indirect dynamical information i'll
97:29
actually
97:30
share you send me an email about a paper
97:33
we just
97:34
uploaded something
97:39
that actually answers the question about
97:41
how this field theory
97:43
and data science approaches work
97:46
together and this is actually an example
97:48
from genomics
97:49
i share it with you in the chat um
97:58
just see where this is here we go
98:09
okay so i'll share it with you in the
98:11
[Music]
98:14
chat all right thank you
98:16
there you go and that that that that
98:18
actually answers the question in your
98:20
email
98:20
and just took and you needed the needed
98:22
the christmas holidays
98:24
some time to to finish and upload it
98:27
and um this is actually an example where
98:30
you have
98:31
static data in genomics you can only
98:34
have static data
98:36
but indirectly you have dynamic
98:39
information
98:40
so actually actually actually although
98:42
the map in this
98:43
this specific example here we have
98:46
static data because of course you have
98:48
to kill the cells you have to kill the
98:49
embryo
98:50
but you can still have time causes
98:54
with different embryos or different
98:56
cells now so that you
98:59
kill one embryo at one stage that's of
99:02
course mouse non-human
99:03
yeah so you kill one embryo at one stage
99:06
and then next embryo one day later
99:09
yeah and the next embryo one day later
99:11
so then the day you get some implicit
99:13
temporal information and actually in
99:16
this
99:16
same paper we also conducted a time
99:19
course
99:20
with a very high temporary resolution
99:23
but
99:24
you always have at each time when you
99:26
kill you look at different cells and
99:28
that's that's always the uh so
99:31
that's the best you can hope for is
99:33
something semi-static
99:35
yeah but but nevertheless yes so we
99:37
still have dynamic information
99:40
indirectly via the measurements that we
99:43
can make for example
99:44
along the dna yeah and
99:48
so that's although we cannot make direct
99:51
dynamic information we can have
99:54
indirectly generate hypothesis
99:57
that we then can indirectly basically
100:00
test using state
100:01
static data if you have a look at this
100:04
at this manuscript and you'll see how it
100:06
works and let me know if you have any
100:07
questions
100:08
it's very a little bit more logical and
100:10
this method
100:11
also has a supplement with all the field
100:13
theory in it
100:15
okay thank you
100:18
okay great perfect
100:29
okay so if there are no more questions
100:31
then see you all next week
100:34
bye