- Review of last lecture
- introduction & slide 1
- slide 2
- slide 3
- slide 4
- slide 5
- slide 6
- slide 7
- slide 8
- slide 9 & 10: different types
- slide 11 & 12: aesthetic
- slide 13: subplots
- slide 14
- slide 15: plot of 3 variables
- question on assigning facet
- slide 16: error bar
- slide 17: interpolation
- slide 18: scales
- slide 19: positions
- slide 20: corrdinate system
- slide 21
- slide 22: extensions
- slide 23: there is also python implementation
Review of last lecture
00:01
uh just uh just to know that the lecture
00:03
is actually recorded
00:05
so you will be visible uh on the
00:07
recorded lecture at the end
00:09
if it doesn't bother you
00:13
just just let you know okay great
00:16
so hello everyone back to our lecture
00:21
last time we had a special lecture a
00:24
guest lecture
00:25
by one of our local data scientists
00:29
and uh he uh which was a fabian host
00:33
and fabian explained us from like a
00:36
hands-on perspective
00:38
because that's his job uh how
00:41
you can detect in
00:45
non-equilibrium systems and the
00:47
non-equilibrium systems that fabric is
00:49
working on are of course
00:50
biological systems and what he
00:54
basically showed is how biology how
00:57
older
00:58
and non-equitable systems manifest
01:00
itself
01:01
in low-dimensional structures in these
01:04
high-dimensional data sets that he
01:05
showed you some
01:07
methods that will also appear on the
01:10
website i didn't get to uploading uh the
01:12
slides in the video yet
01:13
um so so he showed you some methods of
01:17
how to
01:18
reduce dimensionality or how to extract
01:21
hidden dimensions in these
01:24
in the high dimensional data sets and
introduction & slide 1
01:27
so today so i will start by giving you a
01:30
little bit of
01:31
more introduction to data science and
01:35
some of the things that we need for the
01:36
next lecture that's in the first part of
01:38
the lecture
01:40
and in the second part of the lecture i
01:43
will
01:43
give you another hands-on experience for
01:46
a practical
01:47
example of from start to finish of how
01:50
to
01:51
go through such a data science pipeline
01:55
now to start the beginning beginning of
01:58
the lecture
01:59
uh we'll go back to our new york city
02:01
flights
02:02
data set so there's a little gap because
02:05
we had to
02:06
find dates with uh fabian last time so
02:08
there's a little
02:09
gap this lecture is now connected to
02:11
what i told you
02:12
uh two lectures ago and uh
02:16
let me just share the slide i'll tell
02:18
you give you a little a brief
02:20
introduction
02:21
to data visualization
02:24
just a short one because the slides have
02:26
been already
02:28
on the website since two weeks or maybe
02:31
some of you have already looked at them
02:33
let me just share the screen
02:37
there we go
02:40
okay great perfect so so i'll give you a
02:44
quick
02:44
introduction before we go on an enhanced
02:46
on example
02:48
so today we'll have a hands-on data
02:50
science example and next week we'll have
02:52
a hands-on
02:53
field theory combined with data science
02:56
example now that will be
02:57
next week um so today i'll still need to
03:01
introduce you to some methods that we
03:03
will need
03:04
and um if you
03:08
can confirm now that you can see my
03:11
slides
03:12
can somebody confirm is that working
03:15
yes okay perfect yeah so i'll just give
03:18
you a quick introduction uh to
03:20
how to visualize uh data
03:24
and of course you're all in your work
03:25
you're all visualizing data all the time
03:27
but if your data is high dimensional
03:29
very uh uh and uh complex and structure
03:33
uh it actually matters what you use for
03:36
visualization
03:38
yeah and um so
03:41
with this lectures two lectures ago when
03:43
we talked about this new york city
03:45
data set about the flights that were
03:47
departing from the new york city
03:49
airports i showed you all kinds of ways
03:52
of how to
03:53
do very efficient computations
03:56
on these data sets but all these
03:59
competitions like it didn't
04:00
really give us any real insight
04:04
and the reason was that we were dealing
04:05
with plain numbers but we
04:07
never had anything to look at and uh
04:11
today in the first part of this lecture
04:13
i'll quickly
04:15
show you a plotting scheme so to say
04:18
that is very powerful
04:20
in visualizing a general and visualizing
04:23
high dimensional data so
slide 2
04:27
before that just a quick reminder so
04:29
before we do
04:30
anything we always want to make our data
04:33
set uh
04:34
tidy you know so we typically we
04:37
collaborate with experimentalists they
04:39
give us the the
04:40
we obtain the data in a very messy
04:42
format and then the first step is to
04:45
tidy the data and that means that we
04:47
need to bring it in a form where
04:49
every column is an observable or
04:52
variable
04:53
and every row is it is an
04:56
observation or a sample sorry to
04:59
interrupt you did you change your
05:02
first slide yes oh you can't see that
05:05
no we cannot i don't know what this
05:06
always happens now in zoom
05:08
there was a zoom update
05:12
there's something maybe i have to share
05:15
the entire screen
05:17
let's try this um the problem is just
05:21
that i have like
05:21
100 windows on my desktop
05:25
um if somebody has a system that happens
05:30
lately all the time right um that zoom
05:33
is not working properly okay let's share
05:37
the desktop
05:42
okay now i'm not having embarrassing
05:45
stuff on the
05:46
desktop but that's not the case okay so
05:49
you should be seeing my
05:52
uh desktop now
05:58
oh okay okay a lot of
06:01
lot of wind okay okay okay
06:05
now you now you know what i've been
06:07
working on okay
06:08
so
06:12
can you see uh this messy and tidy slide
06:17
and when i change the slides now can you
06:19
see that it's just
06:20
right now it works okay perfect
06:24
so so this is uh so so
06:27
just a just a reminder uh that the first
06:31
step that we always do is to make the
06:33
data tidy now where we if you have the
06:37
data in entire defaults format then we
06:39
can perform
06:40
column wise operations that are in most
06:44
programming language
06:45
highly optimized and very efficient
06:49
to run until to program
06:53
i introduced you to this very simple
06:58
r package that allows you to implement
07:01
all of these operations and data science
07:05
and there are many other packages of
07:06
course and other other statements
slide 3
07:09
and typically such an operation
07:13
consists of three steps so you
07:16
filter the data that is in this data
07:18
table package the first
07:20
parameter then you group
07:24
the columns by some condition
07:27
now that's the third parameter and then
07:29
you perform
07:31
an operation independently on each group
07:33
that is the one in the middle
07:36
so this is a typical step in such a data
07:39
science
07:40
calculation and i showed you how you
07:42
then can combine these steps along
07:44
pipelines
07:46
uh to perform that complex operations
slide 4
07:50
today i want to show you a way of how to
07:53
more intuitively
07:55
interact with the data and the way we
07:57
typically do that you all
07:59
are familiar with uh plotting of course
08:01
and the way we typically do that
08:03
is we have a plot in mind yeah and this
08:06
plot then has a name
08:08
that's a scatter plot a surface plot
08:11
a histogram and then you look for the
08:13
function
08:15
in and you look for the function and
08:17
matlab also that gives you this specific
08:19
kind of plot
08:21
yeah another way to do that and that
08:24
would
08:25
introduced by someone called leland
08:27
wilkins
08:28
in a book is that you don't give the
08:31
plots names
08:33
but you construct you construct these
08:35
plots with a grammar
08:37
that means that you have a set of rules
08:40
that allows you to construct
08:42
step-by-step almost any visualization of
08:45
your data
08:47
and once you have that you don't have to
08:49
remember long names of different kinds
08:51
of plots
08:52
you just add bit by bit like in a
08:54
sentence you add word or
08:56
or there's word by word to make this
08:59
sentence
09:00
more rich in information but the only
09:02
thing that you need to know
09:04
is the grammar itself and this allows
09:06
you to
09:07
create from the simple grammar uh very
09:09
different kinds of visualizations
09:12
now and this idea that you have a
09:14
grammar of graphics
09:16
you know i've got a set of rules that
09:17
allows you to construct
09:19
visualizations uh it is an r
09:22
implemented in the ggplot package
09:25
gtw 2 that's quite famous
09:29
and in python and that's a little bit
09:30
newer i don't know how well it works
09:33
it's called the plot
09:34
9 package now that analyze the same idea
09:39
and in r we just load this using the dg
09:42
plot the library command and then we are
09:45
able to use all of these commands
slide 5
09:47
in this package now so the basic idea is
09:52
that we start with a tidy
09:56
data table or data frame in r
09:59
and we take this and then we construct
10:03
assign different
10:06
visual characteristics of our plot
10:10
two different columns of this data table
10:15
yeah so the first thing we have to do
10:18
is uh the first thing we have to do is
10:21
we have to say
10:22
what to plot a plug the point or a line
10:25
and that's called the geometry
10:28
a point line bar circle whatever and
10:31
that's the geometry
10:33
then we have this mapping that i
10:34
mentioned yeah where we map
10:36
different aesthetic properties of our
10:40
plot
10:41
two different columns in this table here
10:45
now for example we could say that the
10:47
position on the x coordinate
10:50
should be what is in column a
10:53
the y coordinate should be what is in
10:55
column b
10:56
the size of our dots or whatever should
10:59
be whatever is
11:00
reflect whatever is in column d and the
11:03
column
11:04
should miss a c and the column the color
11:07
should be what is in column d
11:12
how these values are then translated
11:16
to specific colors or to specific size
11:19
or so
11:20
is a different question now we have the
11:24
static properties of our plots or our
11:26
dots so where are they located
11:28
how do they look like and then we just
11:30
have to define a coordinate
11:32
system to define where they appear
11:36
on the screen yeah and if you have these
11:39
things together
11:40
we have the simplest version of a plot
11:44
on the right hand side
slide 6
11:48
so the way this works in practice is
11:51
that you
11:52
have these little building blocks
11:55
that you just put together
11:58
line by line now so
12:01
in r this looks like that you first have
12:03
to of course do
12:04
something to create an object and this
12:07
is just this ggplot command
12:10
where you tell the plot what data to use
12:13
as the first argument
12:15
and the second argument is then how to
12:18
map
12:18
different columns of your data
12:21
of your table to different pro visual
12:24
properties
12:25
of your plot and then you add a geometry
12:30
for example point or line and you have
12:32
your first plot already
12:35
you can of course also add more
12:37
properties yeah you can add more
12:39
geometries or you can add more
12:41
uh detailed properties of your plot for
12:44
example
12:45
if you are not happy with cartesian
12:48
coordinates then you can set your own
12:50
coordinate system
12:52
now you can have polar coordinates or
12:53
whatever
12:55
you can have subplots by
12:58
by adding this further rule here it's
13:01
called facets
13:03
you can change how values
13:06
in this data table how they map
13:10
to different properties so which color
13:12
represents
13:13
which value in your table you can change
13:17
themes for example or different uh the
13:20
way
13:20
lines and whatever are plotted and you
13:23
can of course also save your file
13:25
and these different building blocks of
13:27
your plots
13:28
are connected via these plus signs now
13:31
you put that
13:32
just as many as you want of these things
13:34
of these aspects
13:36
after each other and by this you
13:38
construct more and more complex
13:40
plots yeah and for
13:44
everything that you see here below there
13:45
are some sensible defaults for example
13:49
cartesian coordinates that in many cases
13:52
you don't need to touch
13:54
now let's have a little look at our new
13:56
york city data set
13:59
and this new york city data set right
14:02
here we go
slide 7
14:02
now we already discussed that uh
14:06
two weeks ago and in this new york city
14:08
data set we have
14:09
information about flights departing from
14:13
new york city airports for each flight
14:16
we have
14:17
different information we have for
14:19
example the time
14:21
when this flight departed we have the
14:24
origin
14:24
airport we have the number of the plane
14:26
that was used
14:28
the carrier and so on and the delay
14:31
uh for this specific flight and also as
14:35
we already discussed
14:36
you can connect this table that we can
14:39
have
14:40
that you can download from our github uh
14:44
from the from the website um we can
14:47
connect
14:48
these flights then to other sources of
14:50
information for example
14:52
the weather that uh with the weather
14:55
information for a given
14:57
point in time and a given location
15:00
we can also connect that to airport
15:02
information
15:04
uh we can connect that to information
15:06
about the planes
15:07
yeah and we can also get information
15:10
about the airlines if you want
15:12
yeah and that's how we what we do yes we
15:15
reload
15:16
again just like last time where we load
15:19
again
15:20
all of these different files using the
15:23
function f
15:23
read and then we merge them together
15:27
using these merge commands and when we
15:30
merge them together we sometimes have to
15:32
give for example when we merge with the
15:35
planes
15:36
or let me see here for example when we
15:38
move with the airports that we merge
15:41
with the airports we have to say that
15:43
here the airport identifier is in the
15:46
column origin
15:48
and here the airport identifier is in
15:51
the column
15:52
faa now so we motor all of these things
15:56
together and we have a huge data set
15:58
a data table containing all of these
16:01
information line by line and
16:04
in a tiny format we already did that two
16:08
weeks ago
16:10
now let's have a simple look at these
16:13
plots let's just let's say let's have a
16:15
simple plot
slide 8
16:16
and the first thing we can do is uh like
16:19
just like last time we can calculate
16:22
the average delay
16:27
for each month so we group the data by
16:29
month
16:31
and for each month we take the average
16:35
over all departure delays and save that
16:39
average
16:39
in the column mean delay
16:42
now so what we don't get is that for
16:44
each month
16:46
here in the first column we get a mean
16:49
departure delay this is the second
16:51
column
16:52
and below you can see now the simple
16:54
spot you can do you can
16:56
tell this ggplot function to take this
17:00
table
17:01
now a very simple table and map
17:04
the month to the x-axis and the delay to
17:08
the y-axis
17:10
and then you just add a geometry
17:13
which is just a point then you get what
17:16
you see on the right-hand side
17:18
and you see that something is happening
17:20
here in this summer months
17:22
and something is happening over
17:24
christmas apparently
17:26
okay
17:31
so okay there's something something in
17:33
the waiting room right
17:35
okay so something is happening over
17:37
christmas now let's go on
slide 9 & 10: different types
17:40
uh we can of course also add different
17:42
geometries to a plot
17:44
yeah so these different geometries so we
17:46
just use the geometry
17:49
of a line now of a point we can also add
17:52
different geometries
17:54
and for the sake of simplicity what i'm
17:56
doing here i'm using the tools that i
17:58
introduced two weeks ago
18:01
not to do all of these things in one
18:03
line so we take the flights
18:06
data set calculate for each month the
18:08
average delay
18:11
and send everything with this pipe
18:14
operator here
18:16
to the g plot and in the g
18:19
plot we just need to define that this
18:22
aesthetic mapping
18:23
that the x coordinate is the month and
18:25
the y coordinate
18:27
is the delay and we save this everything
18:31
we just save in an object g on the left
18:34
hand side
18:36
and now we can take this g and add
18:38
different things to it
18:39
we can add different geometries on the
18:41
left here top left we have the
18:44
the point as before we can add a
18:47
geometry line
18:48
now then we get a line we can add a bar
18:52
that's called column we can add a bar
18:55
or we can add all of them together to
18:58
the plot
18:59
and then we have all of them together
19:01
this information of what
19:03
what happens with the data that's not
19:06
done in the geometry that's we have done
19:08
once in the beginning and now we can
19:10
just operate on this object and add
19:12
different things and change the part the
19:14
way we like it
19:18
yeah so there are also geometries that
19:21
are
19:21
that involve analysis now for example uh
19:25
if you have a background in biology
19:27
maybe then you know your favorite
19:29
dot box plot on the right hand side that
19:32
summarizes
19:33
different uh properties of the
19:36
statistical distribution
19:38
for example here in the left hand side i
19:41
take the flights
19:42
all this combined information i take it
19:46
use the x the carrier as the x
19:48
coordinate
19:49
and the logarithm of the departure delay
19:52
as the y-coordinate
19:53
and i add this box plot where i have
19:56
automatically
19:57
the median and this is some
20:01
inter-quarter range and then i
20:04
always forget uh forget what this means
20:06
is probably the range of the data
20:08
without
20:08
outliers or so now that's basically in
20:11
some disciplines these box plots are
20:13
used to characterize distributions
20:15
another way to characterize
20:17
distributions are violence
20:19
violent plots that essentially give you
20:22
a plot of the probability distribution
20:24
here
20:25
um just in a vertical manner the thicker
20:29
the violin is
20:30
the more the the higher the probability
20:32
to find a data point
20:34
there yeah and
slide 11 & 12: aesthetic
20:37
uh so of course we can now also play
20:40
with
20:40
how these plots look like so
20:43
for example now i'm and if you look at
20:46
the red
20:47
i'm doing the same operation yeah i'm
20:50
calculating the average departure delay
20:54
for each month and each airport and each
20:57
carrier
20:58
but now i'm just for simplicity i only
21:01
take the big
21:02
three carriers united airlines delta and
21:04
american airlines
21:07
and now i create this plot again now i
21:09
have this aesthetic mapping
21:12
the month should be the x coordinate the
21:14
delay the y coordinate
21:16
and now i have another aesthetic which
21:19
is the color
21:20
i say the color should be the origin
21:23
airport
21:25
and the line type line type should
21:27
correspond to the carrier
21:30
and now i just add the geometry the line
21:32
and i get this plot that you see on the
21:34
bottom here
21:36
you can see that all carriers have a
21:38
problem in the summer month
21:40
and also in the over christmas you know
21:43
except for america so something is going
21:46
on with american
21:47
airlines around march uh no idea what
21:50
this is
21:52
yeah and uh
21:55
no wait that's not american airlines now
21:56
that that's new arc
21:58
newark american airlines has a problem
22:01
in march
22:02
yeah you see it's not the problem the
22:04
plot is not perfect yet
22:06
and uh okay so we can go on now we can
22:09
add
22:10
other aspect we can change other experts
22:13
of the plot
22:14
for example i here say that the film
22:18
should be the airport and then i use a
22:21
box plot
22:22
and then i get an overview over how the
22:25
difference
22:26
airport the airports the different
22:27
airports compared to each other
22:30
for each carrier and what you can see
22:34
is that jfk is doing well for
22:37
some of them but not for all
22:40
you know american airline and for united
22:43
and
22:44
and what is it where is it delta is
22:46
doing well
22:48
um but there is no clear trend here of
22:50
course
slide 13: subplots
22:52
something that's more interesting is if
22:54
you plot these delays
22:57
for the big three carriers yeah
23:00
uh as a as a function of the hour of the
23:04
day
23:06
yeah so here the x-axis the x-coordinate
23:09
is the hour and i turn that into a
23:12
factor
23:13
from numeric to something that is
23:16
discrete just for plotting purposes
23:19
the the fill color is the origin airport
23:24
and i added here a
23:27
subplot now that's called a facet
23:30
by carrier if you remember the two
23:34
lectures ago this is a formula we can
23:36
use a formula in r
23:37
to specify how plots are distributed
23:40
across different subplots
23:42
and you can see that for a delta and
23:44
united airlines
23:46
you can nicely see how these delays
23:49
add up during the day
23:52
yeah and yeah and it looks a little bit
23:54
even that is
23:56
let's have a look at the next slide
slide 14
23:58
maybe uh let's
23:59
have a look at the next slide so if we
24:01
can have also more complicated subplots
24:04
yeah so for example we have we can have
24:06
a grid
24:07
by using a more complicated formula
24:10
where the
24:11
y-axis yeah the y-direction should be
24:13
the origin
24:14
and the x-direction in this grid should
24:17
be the carrier
24:18
yeah and then we got these plots and you
24:21
can see
24:21
how you can actually see how these
24:23
delays
24:25
add up during the day and it seems like
24:28
a little bit yeah it's a speculation
24:30
but because we have a logarithm in the
24:32
y-axis
24:34
and we have linear linear increase
24:37
over time uh in these delays
24:41
uh over over time during the day that
24:44
you have an exponential
24:45
build up of delays now that's quite
24:47
quite interesting
slide 15: plot of 3 variables
24:49
yeah okay so we can do all those kind of
24:53
other fancy thing we can if you have
24:55
more than two variables
24:57
now for example when we calculate the
24:59
average delay
25:02
as a function of the month the hour and
25:05
the origin
25:05
airport yeah then we have more than
25:09
we have then we have then even more
25:11
variables that we want to visualize
25:14
uh we can do that for example with such
25:16
something that's called a heat map
25:19
yeah and this heat map here the
25:24
the the fill and also the color of these
25:27
tiles
25:28
is given by the mean delay
25:31
while the month and the hour are plotted
25:34
on these axes here
25:36
then we add the geometry of the tile to
25:39
get these heat maps
25:41
then you can visualize relationships
25:44
between two variables namely month
25:47
and hour and it seems like this buildup
25:51
of delays is specifically drastic at the
25:54
summer months
25:56
while it's not that evident in other
25:58
months
question on assigning facet
26:02
excuse me i have a question syntax
26:06
in this where you've written facet
26:10
underscore rap the first argument tells
26:13
us
26:13
the x argument and so origin will be
26:16
plotted on the x scale
26:18
yes so so this facet rep
26:21
is just to say okay take
26:25
one column of the data table yeah it is
26:28
already in this case origin
26:31
and group the data according to this
26:33
column your origin
26:35
and then make one plot for each of these
26:39
origin
26:40
airports and put them next to each other
26:44
as many as fit on the screen
26:47
now and if they don't fit on the screen
26:48
go to the next line
26:50
it's basically just this wrap what i'm
26:53
saying here is that you have a
26:54
one-dimensional so to say
26:56
line of plots uh compared to this grip
26:59
that is
27:00
this grid here where was that yeah this
27:03
grid
27:04
is the same thing it's basically the
27:06
same thing
27:07
uh so here we have these two coordinates
27:10
now these two
27:11
directions origin airport on the on the
27:14
y direction
27:17
and carrier on the x direction
27:20
now this is this this formula notation
27:22
and r
27:23
yeah so that's it's a little bit counter
27:25
intuitive but you give it a formula
27:28
uh in order to uh to to tell
27:32
r how all this this package how these
27:34
plots should be distributed
27:36
on your screen or on the on the in the
27:38
pdf file that you export
27:41
yeah and you can if you want you can use
27:46
the reason why the formula i used here
27:48
is that you could do something here
27:51
that you have here carrier plus
27:54
for example um what are we plotting here
28:00
carrier plus uh um
28:04
month or so yeah you can have here
28:08
uh if you do something like this
28:12
you can have more complicated formula
28:15
to say that a combination of carrier and
28:19
month of these two columns
28:21
should be on the x direction here
28:27
and on the y direction you have origin
28:30
now you can
28:30
you can construct more complicated
28:34
grids of plots if you wanted to
28:37
it's very often that's not very useful
28:40
and
28:41
just think about that that this is
28:43
what's on the uh
28:46
that was my that was my question because
28:48
in the next slide the
28:50
original argument the origins have been
28:52
plotted on the x
28:53
scale whereas in this particular case
28:55
the original airport has been plotted
28:56
along the y scale
28:59
so now i exactly so now i'm trying to
29:02
get my mouse cursor back is here
29:06
so okay now i can change the slide
29:08
hopefully
29:10
okay here we go so here i just left away
29:14
the first argument
29:17
and i left the the left one so that was
29:20
originally the y
29:21
direction and now i only have the x
29:24
direction left you know from left to
29:26
right
29:28
and i use here wrap and not grid because
29:30
i want
29:31
this these plots if i have not
29:34
three but like 15 different groups
29:38
i don't want them to be all in the same
29:40
line because i wouldn't be able to see
29:41
them on the screen
29:43
rep means once the screen is full go to
29:46
the next line
29:47
yeah nothing else that's not it's not a
29:50
complicated thing here let's just
29:52
just just make each plot for each origin
29:55
one plug for each origin
29:57
excuse me yes uh on this heat map
30:00
the color bar the minimum value is not
30:03
zero
30:05
does it mean that yes yes
30:08
rights that departed earlier than
30:10
scheduled
30:13
yes there are yes exactly exactly they
30:16
departed earlier oh
30:19
it's not very funny for the passengers
30:22
yes so so but but sometimes that happens
30:25
and it also it's also a question how the
30:27
data is recorded
30:29
um depends on how the data is recorded
30:33
so especially so this is position
30:37
specifically affects the early mornings
30:40
right and the very late times
30:44
now so this negative is negative
30:45
departure delays
30:47
um i mean that sometimes that can happen
30:51
you know so sometimes
30:52
the question is when that's not the time
30:56
when the gates close
30:58
uh it's probably the time when the
31:00
airplane starts or something like this
31:04
yeah and as you as you know if the
31:06
boarding is
31:07
so very often it happens that that once
31:09
boarding is completed
31:11
the the airport they will sometimes the
31:14
airplane leaves a little bit earlier
31:15
than uh than the schedules
31:18
okay thanks but as always there's a good
31:22
thing
31:22
it's always good to to to know how the
31:24
data was actually collected
31:26
now you think that a delay is well
31:28
defined but then you can
31:29
measure this in different ways yeah
31:32
that's always a very important aspect
31:34
aspect and here are also this data set
31:37
they're also
31:37
missing missing numbers a lot of missing
31:41
numbers
31:41
and that's when an airport started
31:44
somewhere but didn't end up on its
31:47
rival location but it's on other
31:49
airports yeah so that's also
31:51
that's also possible
slide 16: error bar
31:54
okay so let's uh go on just uh
31:58
so we can also use uh gg plot you have
32:01
to do statistical computations
32:03
on the fly and this is uh particularly
32:06
useful for computating
32:08
fancy error bars that's that's
32:11
how i use it and so what
32:15
we can do for example here on the top is
32:18
that
32:19
we have the hour on the x-axis the
32:22
departure delay
32:24
on the y-axis and then
32:27
color and uh and fill as the origin
32:30
airport and then for each of these
32:33
combinations because we take on the left
32:35
and raw data
32:36
now we have many different values we
32:38
have many different flights
32:40
and we can then take just a summary
32:43
function
32:44
statistical summary function from the
32:46
gdg plot
32:48
tell this function to calculate the mean
32:53
and use the geometry of a line
32:56
to do the computation for us yeah so for
32:59
so simple things like the mean
33:00
calculation
33:01
that's pretty good but the nice thing is
33:03
that we can also do
33:05
uh summary functions that are more
33:08
complicated yeah for
33:10
example in this here we have a
33:11
bootstrapping
33:13
confidence intervals yeah that's that's
33:16
basically a fancy way of calculating
33:18
confidence intervals
33:20
and we use a geometry of a written
33:23
not to visualize these confidence
33:25
intervals
33:27
and if you deal with confidence
33:28
intervals you know that's quite
33:30
complicated to calculate and then you
33:32
somehow have to bring them into your
33:33
plot
33:34
so here you don't have to worry about
33:36
this you have the most fancy methods
33:37
just
33:38
uh in one line and you get a nice
33:41
visualization of the uncertainty of your
33:43
data
33:45
another thing we can do is also here in
33:47
this upper case our
33:49
x-axis is discrete because it's an hour
33:52
from from 0 to 24
33:55
or 23 or 5 to 23
33:58
but sometimes we have real number values
34:01
yeah and then we need to tell
34:03
these functions which values to put
34:06
together that means we can
34:07
bin the data an example here is the
34:10
temperature
34:11
in fahrenheit on the
34:15
x-axis and we cannot calculate for every
34:19
value of the temperature
34:21
uh we cannot calculate a separate mean
34:23
yeah because temperature is a real value
34:25
it's uh
34:27
a real valued uh quantity
34:30
yeah and we we need to define a bin to
34:33
summarize values of the temperature
34:36
and that we can do automatically also
34:38
with these start summary bin
34:40
functions where we just calculate
34:43
we just tell the function to bin the
34:45
data
34:47
and for each of these bins again
34:48
calculate the mean
34:50
and plot the result with the geometry of
34:53
a nine
34:54
and we can do the same fancy arrow bars
34:57
and confidence interval calculations
34:59
as before and also you know probably
35:03
that
35:03
these kind of processes are quite
35:05
complicated if you have to do them
35:07
yourself
35:10
and here probably the the me the the
35:12
message is that it's
35:14
not good if it's too cold or too hot
35:18
but then there are correlations right so
35:21
when it's hot here in on the right hand
35:25
side
35:25
there's also when the holidays take
35:27
place yeah that's july
35:29
and june yeah in these previous heat
35:32
maps
35:33
where we had these delays yeah so it's
35:35
not
35:36
it's not quite quite clear whether it's
35:38
the temperature that's bad for the for
35:40
the engines or something like this
35:42
or whether it's just the number of
35:44
people that
35:45
go on holiday and block the airport
35:48
and lead to delays there
slide 17: interpolation
35:53
now we can do even more fancy things now
35:55
we can do
35:56
interpolation yeah on the right and the
35:59
left hand side you know i just
36:00
uh had the same plot as before
36:04
we had um yeah we have the month
36:08
wait okay here's a little error the hour
36:11
on the x-axis
36:13
and we do as the hour on the x-axis
36:17
yeah and now we can add like an
36:19
interpolation line
36:20
just in one line with this summary
36:23
function
36:24
and if we wanted to we would be able to
36:26
do like
36:27
non-linear interpolation linear
36:29
interpolation or anything we want just
36:31
with an argument
36:33
and we get our usual nice error bars for
36:36
free
36:38
of course if we can do non-linear fits
36:40
we can also do linear fits so we can fit
36:42
linear models to check for correlations
36:45
and that's what we do on the right hand
36:47
side and on this right hand side
36:50
what is actually quite interesting here
36:53
is that on the x-axis we have the month
36:56
on the y-axis we have the number of
36:59
seats
37:00
in an airplane and the color
37:04
is the origin airport
37:07
now what we find here is that there's a
37:09
perf almost perfect linear relationship
37:12
between the month and the number of
37:14
seats
37:15
yeah there's a perfect linear
37:16
relationship which is positive
37:20
for the for new arc and
37:23
jfk and uh
37:26
negative for lga which is i think
37:29
laguardia
37:30
yeah so no idea where this comes from
37:33
yeah but apparently you are sitting
37:37
in december in a smaller plane
37:40
higher likelihood to sit in a smaller
37:42
plane in december
37:44
if you are departing from laguardia
37:47
while the planes get larger
37:50
throughout the linearly larger
37:52
throughout the year
37:54
for some reason yeah so that's that's
37:56
one of these
37:57
one of these things where you should be
37:58
suspicious and check what is actually
38:01
the underlying
38:02
reasoning for this data yeah so so
38:05
that's that's something i will show you
38:06
also later
38:07
is that it's very important to check
38:10
whether
38:10
when you do statistical computations
38:12
whether they actually make sense or not
38:14
yeah big data gives you every result
38:18
that you want if you just look for it
38:21
now just because you have so many
38:23
dimensions so many samples you can
38:25
find every hypothesis you want in these
38:28
data sets
38:29
if you just keep looking for it
slide 18: scales
38:33
yeah so we can play around with scales
38:36
yeah so that means that we can change
38:38
how our plots look like for example
38:42
uh in this plot here on the top that's
38:44
something we have shown
38:45
seen before that's the box plot we save
38:48
this
38:48
in a variable p now and now you see how
38:51
i this weird arrow assignment operator
38:55
in
38:55
r why why why why the r community
39:00
likes that you can have that you can
39:02
assign in the different direction
39:03
right so i have the plot and i assign
39:06
the results to a variable p
39:09
and that's not something so it's it's an
39:12
asymmetric
39:12
assignment and so now we have our plot
39:16
and we can add different
39:18
color scales and we can add different
39:20
ways of how our data values
39:23
map to visual characteristics of the
39:26
plot
39:28
for example we can add a new scale score
39:31
for some scale color blur then we get
39:33
different
39:34
blue tones we can also add
39:38
like a manual mapping where we say that
39:42
the uwt
39:43
want to have black gray and white as the
39:45
colors for our airports
slide 19: positions
39:48
now so we can add change visual
39:50
characteristics and we can also change
39:52
of course how things are positioned
39:56
relative to each other yeah and i'll go
40:00
quickly over this because that's a
40:01
little bit of a detail
40:03
yeah so we have for example we can
40:05
create this uh plot here on the right
40:07
hand side
40:08
where we for each plot for each month
40:12
origin and carrier carrier calculate the
40:16
average delay
40:17
for the three carriers and then we can
40:20
make a plot
40:23
where we assign the month to the x-axis
40:25
now to the white
40:27
the delay to the y-axis the fill
40:30
color to the origin airport
40:34
and the transparency of this color
40:37
to the carrier yeah and then first
40:41
we can now plot all of this using a bar
40:44
plot
40:45
and if we have a bar plot we can decide
40:47
how to
40:48
put these bars relative to each other
40:51
and i'll just
40:52
give you three examples we can stack
40:55
them on top of each other
40:56
that's on the right hand side
40:59
now we can dodge them
41:02
that means that we put them next to each
41:05
other that's in the middle
41:07
and we can use a fill
41:10
position that means we always fill them
41:13
up to one
41:14
that means we look at the fraction that
41:16
a certain carrier
41:17
and origin airport contribute
41:21
to the total delays and
41:24
uh let me just see if there's any
41:28
yeah you can see here for example so
41:30
that that
41:32
that that here for example a large
41:35
fraction of the delays in march
41:37
comes actually from this newark airport
41:41
uh while other months uh for example in
41:44
the summer month
41:45
the larger fraction of the lace actually
41:47
comes from the other airports
41:49
um jfk and lga
41:53
yeah we can also do something
slide 20: corrdinate system
41:56
we can also change how the data is
42:00
how we can change the coordinate system
42:02
you know so now we always assume that we
42:04
have
42:04
cartesian coordinates but we can of
42:06
course have any other coordinate system
42:09
so here for example we are plotting uh
42:12
as the x-coordinate the wind direction
42:16
as the y coordinate the departure delay
42:20
and then we just calculate the average
42:22
delay
42:23
again using the summary function
42:27
for a certain certain intervals of the
42:30
wind direction
42:33
yeah and we can then plot that for
42:35
example in different ways
42:38
um we can plot that of course in
42:41
cartesian co-coordinates
42:43
something that's more instructive is
42:45
actually when we talk about
42:46
directions is to use polar coordinates
42:50
yeah and you can see that i do that just
42:53
with one adding just one line
42:54
one more rule to the plot
42:58
and now i can add more
43:01
aesthetic mappings for example i can
43:03
separate
43:04
as before these different contributions
43:07
from the wind direction
43:09
by airport and this is what i've done
43:11
here i just added one more aesthetic
43:13
mapping here
43:15
i said that these bars should be next to
43:18
each other and not on top of each other
43:20
or so
43:21
and i have the polar coordinates as
43:23
before
43:24
yeah then you get the plot that you have
43:26
on the right hand side
43:28
and in this plot what you see
43:31
is that there is a relation with the
43:33
wind direction
43:35
of the departure delays specifically
43:38
when the wind
43:39
comes from what is that southwest
43:43
a little southwest west for the two
43:46
airports
43:47
uh lga
43:50
and newark you know whatever is the
43:53
reason for that
43:54
it's actually for all of them yes and
43:56
actually for all of them
43:58
uh it's actually for all of them
44:01
but it's specifically strong for lga and
44:04
new arc
44:05
and if you look at the location of new
44:07
york that's where the sea is
44:09
that's probably also where a lot of the
44:11
wind
44:13
the strong winds come from from this
44:15
direction
44:16
yeah okay yeah this was just playing
44:19
around with the data
44:20
and you get some some insights
44:23
from just looking visualizing the data
44:27
and these insights are of course much
44:29
harder to get if you just look at data
44:31
tables on the console as we did two
44:34
weeks ago
44:35
and what you can also see here if you
44:38
create such plots
44:40
yeah so you can make more and more
44:42
complicated plots
44:44
but the complexity of your code
44:47
never changes yeah it does only
44:50
increases just linearly because you're
44:51
adding just one bit by bit
44:53
to your plot one layer by layer to your
44:56
plot and you can make
44:57
as complicated plots as you want
45:00
from this now without adding
45:04
more and more complexity actually to
45:06
your code or with
45:07
without requiring requiring more and
45:11
more specialized
45:12
functions and that is the advantage of
45:17
having such a
45:19
grammar of graphics now that allows you
45:21
to to have simple rules
45:23
uh that visual rules that allow you to
45:26
add more and more components to a plot
45:29
and then of course we can make these
45:31
plots look
slide 21
45:32
nice yeah we can add things like
45:35
axis labels for all of our columns
45:39
typically you get a data table that is
45:42
that where some experimentalist has used
45:45
their own notation for things
45:47
doesn't make much sense most of the time
45:49
you want to have your own
45:51
um you want to have your own uh
45:54
um names for the for the x's and for the
45:57
colors and for the
45:58
for the legends and uh specifically
46:01
including units if you want to publish
46:03
that and then you can do that easily
46:06
with this labs command and there's also
46:09
a title command if you want
46:11
here you can add a title and
46:14
annotate your plot as much as you want
slide 22: extensions
46:17
and then you can get as complicated as
46:21
you want you can
46:22
download extensions for example some
46:25
nice extensions add
46:26
new geometries and new coordinate
46:29
systems to these plots
46:31
so here these uh plots that are used in
46:33
anatomy
46:34
they add the human body and mouse body
46:37
coordinate systems
46:38
and you can then easily without adding
46:40
having more complexity
46:42
than what i already showed you you can
46:45
have visualize your data
46:49
that you for example imaging data on the
46:52
mouse or
46:53
human body or whatever you want to do
46:55
now that looks like
46:56
a ton of different extensions to this
47:00
okay so this is a very efficient way of
47:02
visualizing data
slide 23: there is also python implementation
47:03
that relies on a grammar or a set of
47:06
rules
47:07
i showed you an r implementation but
47:09
there's also a python implementation
47:12
and the python implementation is rather
47:14
new so it's just
47:16
i don't know what quality this is and
47:20
what we now want to do
47:24
is i want to uh
47:28
i want to show you how to use these
47:31
tools that we
47:33
that we've seen in the last couple of
47:34
lectures in a specific
The following slides are nowhere found in the slides shared on the course website. The lecturer went through the jupyter notebook about a RNA sequencing project.
47:37
data science project and for this we'll
47:40
just
47:41
go through the code of a real data
47:43
science project and this is a project
47:45
that fabian actually did
47:47
while he was in the group and the
47:50
uh starting point of this project
47:54
is a so-called sequencing experiment now
47:57
so i've already showed you this table
47:59
and here on the x-axis yeah
48:02
on the the color the rows in such an
48:05
experiment
48:06
on such that that look that would be so
48:08
to say a matrix that
48:10
experimentalists would send you so here
48:13
every row
48:14
is a different gene and every column is
48:20
a different cell
48:21
now we have now maybe then twenty
48:23
thousand cells thirty seven thousand
48:25
cells
48:26
and for each of these cells we have
48:29
roughly
48:30
ten thousand measurements yeah and
48:33
these measurements correspond to how
48:36
strongly
48:38
uh a certain gene yeah that's on the
48:42
in the row here is expressed in this
48:45
particular cell so these numbers here
48:47
correspond to how many products of these
48:50
genes these experimental techniques
48:53
found in a given cell yeah and the way
48:56
and these genes
48:57
you might hurt they tell us a lot about
49:00
what cells are doing and how they're
49:02
behaving and what kind of cells they are
49:04
so they're very important
49:06
molecular measurements of what's going
49:09
on inside cells now so for example here
49:13
this gene now that has this id here
49:16
that's a cryptic name and row four
49:20
is not expressed it has a little bit
49:23
information in this particular cell but
49:26
not in other cells
49:27
what other genes like this one here have
49:30
very
49:31
high expression values they have very
49:34
high counts of products from these genes
49:38
that were detected in these experiments
49:42
so what i have to tell you is that these
49:44
experiments are extremely messy
49:47
yeah so so especially there are there is
49:49
a step where
49:50
ex the data is exponentially amplified
49:54
and that exponentially amplifies errors
49:57
in these data sets so it's a
50:00
it's a big mess yeah and now we have to
50:02
find some structure
50:05
in these simple in this high dimensional
50:08
data set in these
50:10
genomics experiments and to show you how
50:13
this is working i'll share another file
50:19
um i'll share another screen
50:22
[Music]
50:24
where are we
50:28
here
50:31
here we go now i'll just give you a uh
50:34
like a hands-on look and
50:37
on how this actually works yeah i don't
50:40
tell you too much about the biological
50:43
background in this project because it's
50:45
not yet published
50:47
um so um can you see you can you should
50:50
be able to see
50:51
the browser right
50:54
and you can see that this here is
50:57
actually a combination
51:00
of python yeah the first block
51:03
and r yeah so this here
51:07
here he's loading some r packages
51:10
and here he's loading python packages
51:13
and all of this is a jupyter notebook
51:15
yeah so yeah combining r and python to
51:18
take the best of both worlds
51:23
yeah and then we then there's a lot of
51:25
data loading going on
51:27
yeah so i'll just go through of course
51:29
we don't have to look in detail how the
51:31
data is loaded
51:32
uh some some biological background
51:35
information about what
51:36
different genes are doing and
51:40
so on yeah and now
51:45
we start with the pre-processing of the
51:48
data
51:49
so as total that this data is messy
51:53
and this data has like
51:56
80 percent nonsensical
52:00
the nonsensical information in other
52:03
words this is dominated by
52:05
technical noise our technical noise is
52:08
extremely strong
52:10
and it gives rise to very weird results
52:13
so the first step we always have to do
52:16
and this
52:17
particular example in genomics but also
52:19
in other data sets
52:20
is we have to look at the data and and
52:23
polish it in a way so that we can are
52:27
actually in principle able
52:29
to detect information here
52:34
so for example this plot here on the top
52:38
shows you basically what is the
52:40
percentage
52:42
of all information in a cell
52:46
that goes to certain genes yeah they
52:48
have these weird names they're
52:49
completely
52:50
irrelevant but you see that this gene on
52:53
the top here
52:54
that has an even weirder name
52:57
in some cell comprises eighty percent of
53:00
the
53:02
information yeah and that
53:05
does not make any biological sense but
53:08
because if you have
53:09
thirty thousand genes in a cell it can't
53:11
be that basically the cell
53:13
is packed completely packed with
53:15
products from a single gene you know
53:17
that that's
53:17
cannot happen in real life and that's
53:20
why we see that here there are a lot of
53:22
cells
53:23
yeah so everything where we have more
53:25
than maybe 30 percent here
53:27
where we don't have any reasonable
53:31
information now that means that we need
53:34
to do quality control now we
53:36
need to filter out cells that actually
53:40
have meaningful information and we have
53:42
to
53:42
keep away cells that don't have
53:44
meaningful information
53:46
yeah and what we do is we look at such
53:50
histograms here
53:52
so we took a look at these histograms
53:54
and we calculate probability densities
53:58
over all cells so all columns of this
54:01
matrix
54:02
and on the x-axis here is the
54:06
total amount of information that we have
54:09
for
54:09
for a cell so the total number of
54:11
molecules that we detected
54:13
for a single cell and you can see this
54:15
follows a distribution
54:18
and the first thing that you see there
54:20
are two blocks now some cells are worse
54:22
and some cells are better so there is
54:25
already some variance in the data
54:27
just because the quality of our
54:30
measurement
54:31
is different for two between two groups
54:33
of cells
54:35
but all of these are actually good
54:37
values yeah they're all good
54:39
yeah and uh we just take out some cells
54:42
here that is the
54:43
vertical line that are below one million
54:46
of these counts now these we throw away
54:51
now we can also
54:53
see look at other accounts so um
54:59
so this is for example how many genes
55:01
that we detect
55:02
and here we also cut off
55:06
here we also cut off
55:09
basically cells that are low quality
55:12
in these tails here we just remove them
55:14
from the data set
55:16
because we know that if we kept them in
55:18
the data set in the long term
55:19
we would have problems with
55:21
dimensionality reduction now they would
55:23
these things dominate in the end
55:26
things that are based on machine
55:28
learning clustering
55:29
dimensionality reduction so we remove
55:31
them from the data set
55:34
and now that's sorry excuse me
55:36
replications
55:37
yes so is there any uh is a systematic
55:41
way to set
55:43
the threshold to filter out
55:46
in this case uh it's a matter of
55:49
experience
55:50
yeah there's not a systematic way
55:53
normally
55:53
[Music]
55:55
you would set the threshold in such a
55:57
data set
55:58
you would set the threshold here in the
56:00
middle between the two
56:02
peaks but between but because the two
56:04
peaks are both at reasonable values
56:06
and they're all of the same height yeah
56:09
then
56:10
uh you know i said we would get we would
56:12
lose 50
56:13
of the data and that's a little bit too
56:15
much yeah but they're all reasonable but
56:17
we have to check later
56:19
that if we find two groups of cells in
56:22
the data
56:23
that these two groups are not
56:26
just representing uh these two peaks
56:30
here in the quality of the measurement
56:32
yeah the later we have to we now go go
56:35
on with analysis
56:37
and if there's something is suspicious
56:39
we go back to this stage here
56:42
and we might have to be more rigorous
56:45
with this
56:45
cut off here yeah so in this case in
56:49
this particular case there's no rigorous
56:51
way of doing that
56:52
it's a matter of how much do you expect
56:54
what is the good measurement
56:56
and uh basically now this is this is a
56:59
this is a
57:00
this is a pretty good example in terms
57:03
of these counts here
57:04
now if you if you're working on genomics
57:06
so this is this is actually zebra fish
57:08
so we have less than in other animals
57:11
in total and um but but here we don't
57:15
have basically
57:17
sometimes you have a another peak here
57:20
at very low values
57:22
and that we would then completely cut
57:24
off
57:26
so here the problem more is this one
57:28
here this plot
57:30
we have a lot of cells that have a high
57:33
percentage
57:34
of mitochondrial accounts so that that
57:37
are genes
57:38
or the dna that is in the mitochondria
57:42
and these genes do not produce
57:45
this dna does not produce many gene
57:48
products
57:49
yeah so it suspicious if you have too
57:51
much of that in this
57:53
in these cells and here we take out
57:56
roughly i guess 20 or 30
57:59
20 20 of the cells
58:02
we lose in this step
58:05
yeah and we can also plot both with
58:08
respect to each other
58:09
for example we can here on the x-axis
58:11
have these different values that
58:13
represent the quality of our data
58:15
and then just in half these bars these
58:19
vertical and lines that we had in the
58:20
histograms
58:22
just in this scatter plot here yeah and
58:25
cut
58:25
and see which what are the kinds of
58:27
cells that we use here
58:29
uh visually yeah
58:33
okay now so now now we get rid of the
58:36
bad stuff the other things that are
58:38
totally crazy now the next thing
58:41
we need to do is to make
58:44
cells comparable so cells
58:48
still have different uh cells that
58:51
have different measurement qualities
58:54
they're still
58:55
different based on technical reasons so
58:57
for some cells we have a lot of
58:58
information
59:00
now we have a lot of these counts that
59:02
we detected
59:03
and in other cells we have less but we
59:06
want to make them comparable to each
59:08
other
59:08
and that's why we have to normalize the
59:11
data
59:12
yeah and there are fancy ways in
59:14
genomics there are fancy ways of doing
59:16
this and you can see we're
59:17
doing all of that basically and uh
59:21
so but we have to normalize the data you
59:24
have to make them comparable that's what
59:26
you always have to do
59:27
and what we also do here is because
59:30
these
59:30
these uh data counts you could see
59:34
like in the matrix that i showed you
59:36
there were very large numbers and very
59:38
small numbers
59:40
yeah and what we have so these
59:43
counts they live on an exponential scale
59:46
that's their these distributions are
59:48
very skewed it's a few cells that have a
59:51
huge amount or a few genes
59:52
that have a huge amount of these counts
59:55
of these
59:56
measurements yeah and that's not
59:58
something that works very well with
60:00
dimensionality reduction
60:02
or clustering methods so we take the
60:04
logarithm
60:05
we log transform the data for
60:08
any further processing now that's also
60:11
something that you do if your
60:13
data is too spread the other it comes
60:15
from some
60:16
exponential process also yeah then you
60:19
have the log transformers you want it to
60:21
be something
60:22
like normally distributed something
60:24
symmetrically and
60:25
rather compact as a distribution
60:30
yeah okay let's go on so here we can do
60:33
a variance stabilizing
60:35
uh transformation and we do more stuff
60:37
on the data
60:40
and then we can go and now we can start
60:44
to understand the data the first thing
60:46
we need to do
60:48
is to see does it actually make sense
60:51
what we do
60:52
what we have here are we actually
60:54
looking at biological information and
60:56
real information
60:57
or are we just looking at technical
61:00
aspects
61:01
of the experiments and as a first step
61:05
what we do here is we
61:08
plot a little pca fabian showed that
61:11
last week what a pca is
61:13
and these data on the pca on the
61:15
principal component analysis plots
61:18
they look like this and you can see
61:21
so here i plot for example the total
61:24
amount of these counts in a cell you
61:27
know that's a
61:28
that's a technical that's just measuring
61:31
the quality
61:33
of the measurements and you can see
61:35
there is some variability here so that
61:37
these cells have a little bit more these
61:40
cells are a little bit more
61:41
less and some of this technical
61:43
variability is captured
61:46
by the um by these experiments
61:50
by this principle component analysis so
61:53
here we're fine
61:54
with it it's not extreme we don't have
61:56
disconnected clusters
61:57
so we are fine with this it's already in
62:00
a good shape
62:01
and also we know that in cells can have
62:05
these differences
62:06
in the total number of molecules in the
62:09
cell for biological reasons
62:13
now and then we can look at these plots
62:15
here what is the percentage of variance
62:18
actually explained by certain principal
62:21
components now that's actually the
62:23
y-axis is
62:24
something else but we can order these
62:27
principal components
62:29
based on how much they explain the total
62:31
data
62:33
and what we do and that this is sort of
62:35
saying like a very professional
62:37
way of doing things is we do all other
62:39
calculations not on the real data
62:43
but on the principal components of the
62:45
data yeah that's an
62:46
intermediary step that we do just to get
62:49
cleaner results in the end
62:51
so so what we do here is we take the
62:53
first
62:54
20 25 or so principal components
62:58
that constitute like 99 of the variance
63:01
of the total
63:02
plot and we say the rest is noise that's
63:04
a way of getting rid of the noise
63:06
in the data and now we go on
63:10
yeah and do dimensionality reduction
63:12
further dimensionality reduction
63:16
let me make this larger
63:24
now i hope you can see these plots
63:28
so this is a u map yeah so this is the
63:31
this is a umap
63:32
uh that is a non-linear way of reducing
63:36
the dimensions that
63:37
fabian showed to you last week
63:40
and you can see once we do the
63:42
non-linear dimensionality
63:44
reduction our data looks already much
63:47
more structured so these cells here
63:51
are from actually from the brain these
63:53
are brain cells
63:55
and of course there are different kinds
63:57
of cells in the brain
63:59
and because they're different kinds of
64:01
cells in the brain we also
64:03
expect here a structure
64:06
to pop up in these low-dimensional
64:09
representations
64:11
typically these clusters correspond to
64:13
different kinds of cells
64:16
yeah so we don't know that's just a gray
64:18
bunch of cells here
64:19
of dots and the two dimensional planes
64:21
we don't know what the x's are
64:23
and we don't know what these cells are
64:25
and now we have to dig a little bit
64:26
deeper
64:28
and we do clustering so here is
64:31
clustering
64:32
and that's clustering that's one of
64:35
these community based clustering
64:37
algorithms
64:39
and we did that for several
64:43
resolutions of the clustering yeah so in
64:45
clustering
64:46
in most of the time you have to tell the
64:48
algorithm
64:49
how many clusters you want yeah
64:53
and uh that's that's kind of a
64:54
resolution
64:56
that you give to these algorithms and uh
64:59
by doing this
65:00
you uh and what you see here are
65:03
different clusterings
65:06
with different resolutions now so here
65:09
you say okay
65:10
give me what is that 15 clusters
65:13
then you got this plot on the bottom
65:14
left if you say okay
65:16
give me what is it eight clusters then
65:19
you get this
65:20
these clusters on the top left
65:24
and you can have more clusters if you
65:27
want yeah
65:28
they're different stages but we don't
65:30
know yet
65:31
what makes sense yeah we don't know how
65:33
many real clusters there are in the data
65:37
but we can take all of them and we can
65:39
go one step further
65:41
how do we know that such a cluster is a
65:43
real biological cluster
65:46
we know that if the cells in a cluster
65:50
all share some property that is not
65:53
shared
65:54
by other that is not shown by other
65:57
cells
65:59
now then we know that this cluster is
66:01
something real
66:02
uh that really is going on in the brain
66:06
and the way we do that is
66:09
let's go down then we look at the
66:12
literature
66:13
so now we look at the literature we look
66:15
at papers
66:16
yeah so we say we look at papers and
66:18
then in these papers
66:20
we see okay there are different cells
66:22
and kinds of cells in the brain
66:25
and people have done experiments
66:28
genetic experiments for example where
66:30
they found for example
66:32
that stem cells express a certain gene
66:36
now for example this gluolar gene here
66:38
that's expressed by stem cells
66:40
and now we plot this umap with the color
66:45
representing how much of the products of
66:48
this glue
66:49
gene we found in a certain cell
66:53
and now that makes a little bit sense
66:54
and now here in this corner here on the
66:56
top left
66:58
these are our stem cells
67:01
and then we can see okay so there are
67:03
also neurons in the brain and
67:05
many other things let's what's going on
67:07
yeah so there's another cell type that
67:09
expresses this
67:11
cell yeah that's the more advanced cell
67:13
type
67:14
from stem cells yeah that's all now in
67:17
the next step here let's express
67:18
in these cells here and then you could
67:21
can go down
67:22
and identify all of these clusters
67:27
step by step and identify
67:30
what kind of cells you have in the data
67:34
yeah you can do that with more genes
67:37
even
67:38
so there's a lot of genes like these
67:40
genes here that identify neurons
67:43
and different kinds of neurons so here
67:45
we have a little
67:46
just feature of this plot that is
67:48
identified by this gene
67:50
and if you talk bibologies all of these
67:52
names uh
67:54
are associated with different shapes or
67:56
different functions of cells
67:59
we can also do more fancy stuff and to
68:01
look at different gene scores and add
68:03
groups of genes
68:05
do statistical computations yeah and
68:08
once we've do
68:09
done that we decide okay
68:12
for these set of genes here we have a
68:15
unique
68:16
we fulfill this condition that each of
68:18
these clusters here
68:19
for these clusters we fulfill the
68:21
condition
68:22
that each of these clusters has
68:26
a certain biological function or
68:28
represents a biological function
68:30
because we found a gene in the
68:32
literature
68:33
that corresponds to a certain cell type
68:35
in the body
68:37
that is expressed in one of them but not
68:39
in the others
68:41
yeah so that's uh that's very important
68:45
and then we can give these clusters
68:47
names here for example
68:49
retinol or radioglia cells
68:52
uh uh only oligodendrocyte
68:56
precursor cells and so on and neurons
68:59
and then you find okay so these orange
69:01
ones here are the neurons
69:02
yeah and uh here were the stem cells
69:06
and then you can start thinking okay
69:08
these sensors somehow
69:09
turn here into these neurons and they
69:12
mature
69:13
now they get they get more mature and
69:15
then at the end they turn into these
69:17
neurons here
69:18
and then we have other cell types in the
69:20
brain like microglia
69:23
and so on that we also can find here
69:27
now remember in the u map the
69:31
distances in the u map is the humor
69:35
keeps the global topology intact
69:38
now that means that cells that are close
69:40
in here are
69:42
actually also very similar on uh
69:45
in the in this high dimensional space
69:49
so it's actually tempting that you think
69:50
about these
69:52
paths here as a trajectory that cells
69:55
take while they take from while they go
69:58
from stem cells
70:00
into neurons in the brain
70:04
now so let's go on now there's a lot of
70:06
consistency checks
70:08
yeah so you have to check all kinds of
70:10
genes
70:11
yeah a lot of them and discuss a lot of
70:14
with the people who actually know
70:16
who do nothing else in their lives but
70:17
to look at these cells and the brains
70:19
and they know
70:20
all of these genes and all of the papers
70:22
in these genes
70:23
yeah and i also do more fancy stuff
70:28
yeah and uh then you can do with
70:32
different clustering uh and now we have
70:35
a
70:36
identification and this one i said is
70:38
the one
70:40
that we can live with with these eight
70:42
clusters
70:43
while for these higher clusters now we
70:46
have
70:47
several several clusters representing
70:49
the same cell type
70:51
and we can in principle come back later
70:53
to these higher clusters
70:56
and typically biologists once want you
70:58
to do to get as many clusters
71:02
as you can yeah um
71:05
yeah so now we have these classes and we
71:06
can have some measurements
71:08
of how good these clusters are actually
71:12
and there are specific plots for these
71:15
for this
71:15
for example these are called dot plots
71:18
and
71:19
um what these plots show is on the
71:22
x-axis
71:23
our gene names and on the y-axis
71:27
are the cluster names yeah
71:31
and the color represents how much
71:34
a gene on average is present in the
71:38
cluster
71:40
and the size of these plots tells you
71:44
how what is the fraction of cells that
71:47
have this gene on
71:49
in this cluster now so and if we go now
71:53
here
71:54
what we want to see is that these genes
71:56
that we have here
71:58
are only on in one cluster but not in
72:01
other clusters
72:02
now for example a good thing is this one
72:05
here
72:06
this only go this oc cluster
72:10
now there's only one cluster here
72:13
there's only there are these these genes
72:16
here that we have
72:17
identified for this cluster that is
72:20
called
72:21
marker genes the other computation tools
72:23
to detect them
72:25
they are only you only find that in this
72:28
cluster
72:28
but not in the other cluster same for
72:32
this mg this microglia
72:34
we only find these genes in this single
72:36
this one cluster but not in any other
72:39
cluster
72:40
and while we go for what's a messy one
72:44
opc's
72:45
or we go for this neurons here
72:49
then it's more messy and it's not so
72:53
clearly defined
72:54
so if we go back now for example
72:57
now if we go here if we look at these
72:59
plots then these
73:01
mg's microglia they were very clean
73:04
in this plot that i just showed you yeah
73:07
and that's also represented in this new
73:09
new
73:10
you map they're a different cell type
73:14
that is not produced presumably by the
73:17
same stem
73:18
cells as the neurons
73:21
also these ocs yeah as oligodendrocytes
73:25
i think
73:26
yeah that's separated from the rest so
73:29
there's no overlap here there's a
73:30
distinct cell types
73:33
while for things like that we had an
73:35
overlap between these material neurons
73:38
at the sixth cluster that's called six
73:40
here
73:41
yeah there we found that
73:44
there is an overlap actually in these
73:46
marketers they have the same genes
73:48
that they express and probably that
73:50
means that this cluster six
73:52
is an artifact yeah that we cannot take
73:55
too seriously
73:56
and on the right hand side you can see
73:58
that in the next step now we took these
74:01
i have went a little bit broader and
74:03
took this cluster six
74:04
into the interval neurons here
74:08
now so now of course i showed you
74:10
basically a finalized version
74:12
normally you go back between these steps
74:14
and the earlier steps
74:16
again and again the call it back to the
74:17
quality control
74:19
until you have that do not have any
74:21
trace
74:22
of experimental technical parameters
74:25
in your plots and then also you go back
74:29
between these clusterings your
74:31
experimental friends
74:32
who do the experiments and the
74:35
literature
74:35
until you find something that is really
74:37
corresponding to
74:40
to to what makes biologically sense
74:43
it's not that these techniques give you
74:46
automatically
74:47
something uh that this was like a
74:50
mathematical criterion
74:53
that would tell you uh what makes sense
74:55
and you have to do that always have to
74:56
do that yourself
74:59
that's not that you push a button and
75:00
then everything is
75:02
works automatically and that's why these
75:04
people who do these kind of
75:06
analysis are very much looked looked for
75:09
on the job market
75:12
okay so uh let me just see if there's
75:15
any
75:16
other anything other interesting i think
75:17
there's you can go on and on forever you
75:20
know you can give the experimental lists
75:22
you know so which genes are on and off
75:25
in which cells
75:26
and once the experimentalists have these
75:29
lists they can do more
75:30
experiments uh they can create for
75:33
example
75:34
new animals that lack these genes
75:37
and what i want to show you
75:40
is now something that's on the bottom
75:44
you know so that the analysis is very
75:46
lengthy
75:50
now there are many aspects of course
75:51
that are irrelevant for biology
75:54
um i just want to show you
76:01
a lot of stuff a lot of stuff now that's
76:04
what the biologists are interested in
76:05
right
76:06
so these they hold these jeans um
76:09
all the stuff all the stuff a lot of
76:11
stuff even more stuff
76:20
yeah if more more more more also a lot
76:22
of calculations
76:24
yeah more heat maps to check
76:28
you know which genes are on where and so
76:30
on yeah so that's all
76:32
uh uh consistency checks
76:35
and one thing i want to just to show you
76:38
is uh something that's called let me see
76:40
do we have
76:42
[Music]
76:43
okay is something that's called
76:46
trajectory inference you know that's
76:48
so so so these are cells from a brain
76:52
and what cells do is they start that
76:54
they divide and they produce other cells
76:57
and then they get more specialized over
76:59
time
77:00
so these cells they start as a stem cell
77:03
and then
77:04
they mature until they at some point
77:06
they are in you
77:07
a new one so what we did here is
77:11
uh what you did here and what you do in
77:12
many cases is you take
77:14
okay i have a snapshot data so this
77:17
fish or these fish were killed for the
77:20
measurement
77:21
i don't have a time course or anything
77:23
now i don't uh as a snapshot measurement
77:26
just one measurement
77:28
but i have different cells that are at
77:30
different stages of this dynamic process
77:33
of cell maturation so what then people
77:36
do is uh can we get the temporal
77:39
information back
77:40
yeah and this is the trajectory
77:42
inference of how
77:44
these different clusters and cell types
77:46
relate to each other
77:48
and where you then can calculate rates
77:51
of how
77:51
one cell go one cell type
77:54
the flux that from one cell type leaks
77:57
leads to another cell type
77:59
now and in principle based on this you
78:01
can then come up with stochastic
78:03
models that you can in principle also
78:05
compare them to
78:06
uh to other experiments or theoretical
78:09
work
78:10
now that's i just wanted to show you uh
78:13
at the end
78:13
and let me just go through here if
78:15
there's anything else that's worth
78:17
showing you
78:18
um a lot of stuff now then you can
78:21
compare to humans and other animals
78:24
you can see where there are similarities
78:27
uh
78:27
these fish here that we're looking at
78:29
are very interesting because they can
78:31
regenerate their brain they can build
78:33
new neurons
78:34
that's something we cannot do so it's so
78:36
much that we want
78:38
we want to learn what are the
78:39
similarities and difference why are we
78:41
not
78:41
able to do that as humans yeah and
78:45
um yeah and then uh
78:48
we are already you know at the end of
78:51
this very lengthy analysis
78:53
yeah so this is a typical data science
78:56
project
78:57
from start to end and so you can see
79:00
here a mixture of
79:01
r and python and
79:05
the important thing is that you cannot
79:07
what we showed you
79:09
in the last lectures you cannot just go
79:11
and take the data and
79:12
throw a umap on it or do some machine
79:15
learning on it
79:16
so a large part of the data
79:20
of this pipeline here is to actually
79:24
clean up the data and think about what
79:26
is the part of the data that makes sense
79:29
and what is the part of the data that
79:31
does not make sense
79:33
yeah so for example think about this new
79:34
york city flights
79:36
it doesn't make sense to have a negative
79:37
departure delay and it was a
79:39
it was a very uh that was a very good
79:41
question
79:42
yeah it actually uh makes sense and this
79:44
is so to say a data set
79:46
that is used for for teaching a lot so
79:48
it's already cleaned up a lot but
79:50
typically you expect
79:51
to have a lot of nonsensical
79:54
measurements
79:55
uh in your data yeah or sometimes you
79:57
have a departure delay of
79:59
10 billion years or so that that would
80:02
be the correspondence in a real data
80:04
somebody made a typo somewhere yeah and
80:06
then you have that in your data set and
80:08
you have you have to
80:09
have to filter it out again now this
80:11
happens all the time if you
80:13
are not looking after this yeah so if
80:16
you're not taking care of this
80:18
then all of these nice plots here that i
80:20
showed you
80:21
they won't work now they only work
80:23
because we clean up the data
80:25
we normalize the data we transform the
80:28
data in a statistical way to make it
80:31
uh nicely behaved yeah it's
80:34
no huge outliers or so and then these
80:37
methods
80:38
like uh umap and so on work on the data
80:42
but you always have to do these
80:44
pre-processing steps
80:46
before that to make that work and the
80:47
next step is you you have these fancy
80:49
methods
80:50
but taking a loan they don't make sense
80:53
so you can see here how this order has
80:56
ordered this
80:56
trajectories cells move here along in
80:59
time
81:00
move along this line and turn into
81:02
neurons here
81:04
and of course we have many different
81:05
neurons in the brain
81:07
yeah and but to understand that what's
81:09
happening here this data to make sense
81:11
of this
81:12
now you have to come up with hypothesis
81:15
you have to connect these hypotheses
81:16
with
81:17
what is already out there in the
81:18
literature and then step by step you can
81:21
construct
81:22
an understanding about what are actually
81:24
here the degrees of freedom
81:26
that you see in this data set you know
81:29
and so this is this is an example and
81:32
this
81:32
is an iterative process that you improve
81:36
over time and so this is an example
81:39
that's a purity
81:40
purely data science project or there's
81:43
very little physics in it
81:45
and next time we show you how all of
81:47
these connects to something that we
81:49
actually
81:51
did in the first part of this lecture
81:53
namely field theory
81:54
face transitions criticality and so on
81:58
okay great so i'll stay online and in
82:01
case there are any questions otherwise
82:03
see you next week
82:08
bye
82:21
thank you it was very interesting yeah
82:24
so when are you when are you going