这一讲由嘉宾给出,官网没有课件。所以不再做课件帧数的标记了。作为替代,随堂做了笔记,参见笔记部分。
00:07
okay so let's
00:08
uh let's join uh so welcome
00:12
to our uh lecture again now to this uh
00:16
today we have a special guest as we uh
00:19
which is fabian
00:21
ross that you see on the other window
00:23
and as you remember
00:25
uh we before christmas we introduced uh
00:28
theoretical physics concepts that
00:30
explain how order emerges
00:33
in non-equilibrium systems and then last
00:36
week i started
00:37
introducing some basic tools
00:40
from data science namely what are the
00:43
tools that we can use
00:45
to computationally deal with large
00:48
amounts of data
00:50
and today so i need some special help
00:53
from somebody who's a real expert
00:56
which is far beyond you know fabian is a
00:59
trained
01:00
uh physicist and did a phd in
01:02
mathematical biology
01:04
i then did a postdoc in my group at the
01:07
max plain institute in dresden and now
01:10
he's a professional
01:12
uh data scientist and bioinformatician
01:16
at the center for regenerative therapies
01:18
in dresden
01:20
and so he's talking actually about
01:21
something that he's uh
01:23
doing for his professional job which is
01:25
very nice
01:26
and uh over to you fabian thanks a lot
01:29
for
01:30
for sharing your expertise yeah
01:33
thanks thanks for having me here uh for
01:36
giving miss
01:37
uh giving me this opportunity so to give
01:39
a lecture on
01:40
exploring high dimensional data so i
01:42
think uh stefan
01:44
introduced me very well thank you very
01:47
much and
01:48
and also made the point of where we are
01:51
in the course
01:53
so i just wanted to ask you i just would
01:56
uh to say that uh whenever you
01:58
uh wanna ask something i think during
02:00
the course you just like uh unmuted
02:01
yourself and ask the question so that's
02:03
totally fine to me also
02:05
type something in the chat and i try to
02:07
try to answer and then uh
02:09
like if you're not saying anything uh
02:11
please mute yourself to reduce noise
02:14
um
02:18
all right so um okay i'm
02:22
i'm going to start right away
02:25
um so today will be about um exploring
02:29
uh high dimensional data
02:30
right and as many of you
02:34
know with a high dimensional and in high
02:36
dimensions
02:37
strange things can happen so if you for
02:39
instance look at this thing here what
02:41
you see here is a three-dimensional
02:42
projection
02:43
of a four-dimensional object rotating
02:46
and this
02:47
looks very funny right and i mean like
02:49
four dimensions is where we kind of
02:51
still can kind of get a glimpse but but
02:53
what if it gets to
02:54
thousands or millions of dimensions what
02:56
do we do with these kind of data
02:59
um so before i start
03:02
uh let me let me give you a brief
03:04
structure of the course
03:06
today of the lecture um
03:18
might have to click into the window
03:21
no i actually want to share okay
03:25
my ipod screen yes sorry for the delay
03:29
um all right so so i basically start
03:33
by introducing what i mean by high
03:35
dimensional data
03:41
um i will then
03:44
cover two dimension
03:48
dimensionality reduction methods
03:51
[Music]
03:58
and also talk a bit like what's the idea
04:01
of of
04:01
reducing dimensions and third if we
04:04
still have time i'll talk a little bit
04:06
about
04:06
clustering which we kind of can think of
04:08
as a discrete form of dimensionality
04:11
reduction i'd say
04:13
so what do i mean by high dimensional
04:16
data
04:18
well let's let's start with an object
04:21
that i call a data matrix
04:28
um x which is of shape n by m
04:41
and maybe you remember from last week um
04:45
tidy data formats so this um uh
04:49
matrix should be in a tidy format
04:50
already so in the rows we have
04:52
n samples
04:57
and in the columns we have m
04:59
measurements or observables
05:08
um so what do i mean by this so so an
05:11
example could be what you what you
05:12
measure here is for instance you um
05:15
uh take a uh take yourself right and you
05:19
measure for instance the length of your
05:21
thumb and the length of your index
05:22
finger and the length of your middle
05:24
finger and so on you do the same thing
05:26
on there
05:26
with the other hand you measure your
05:28
foot size your body height
05:30
maybe uh the heartbeat rate and so on
05:32
and and uh
05:33
you end up gathering more and more
05:35
observables
05:38
and that's m observables and this gives
05:40
us the dimensionality of our data
05:44
um and now you can do this with yourself
05:47
and you can do this well you can ask
05:49
your friends to do it with them and so
05:50
on and then
05:52
you can maybe do it with all people in
05:55
asia all over the world and this uh
05:57
gives you the number of samples that you
05:58
have right
06:00
um now how do you how do you treat these
06:04
this kind of data right if we go one
06:06
step back
06:07
and uh think about how how can we
06:10
visualize
06:11
and uh stephan uploaded some slides on
06:14
that and and maybe we'll also talk on
06:16
that a little bit but
06:18
what can we what can we actually
06:20
visualize easily
06:22
um
06:28
so if you uh if you would have one
06:31
dimensional data right that's rather
06:33
easy right
06:34
you you can um just uh plot your data
06:38
on a one-dimensional line for instance
06:40
and also
06:41
for two dimensions
06:45
um i think this is pretty
06:47
straightforward right
06:49
um so if this would be the
06:53
entries of our data matrix x1 x2
06:57
um we can just plot them as a scatter
07:00
plot right
07:02
and we could for instance see whether
07:03
they are correlated or not
07:05
but um as soon as we are like
07:08
in in three dimensions or larger like
07:10
three dimensions we can still
07:11
handle we can do a three-dimensional
07:13
plot maybe but this is still
07:15
hard to quantify um this this gets um
07:19
this gets uh non-trivial right and so we
07:22
need techniques to
07:24
to uh think about how to actually look
07:26
at these data and how to analyze these
07:28
data so
07:30
the aim of dimensionality reduction then
07:35
i'd say is to reduce the number of
07:42
observables
07:50
while keeping
07:57
relevant information
08:04
and what this relevant information is
08:06
right this is something uh
08:08
you you have to decide by the problem at
08:11
hand
08:12
um now um
08:15
before i go
08:18
into talking about how you can reduce
08:21
dimensionality
08:22
i wanted to show you some examples
08:26
so i'll share some slides for this
08:33
um and so
08:37
i think as many physicists are in the
08:39
audience one example i wanted to share
08:41
something
08:41
that i think many of you are already
08:44
know
08:45
which is um where we know where many of
08:48
you know that a
08:49
low number of observables actually
08:51
suffice to to describe the system very
08:53
well and that's in the realm of
08:54
statistical physics and you
08:56
saw a lot of uh examples from stefan by
08:58
that
08:59
but even in a rather easy example like
09:01
an and paramagnetic
09:03
iso model right you know that there's uh
09:07
there's many many spins in this system
09:09
right and imagine you could measure all
09:11
these spins
09:12
um you have a very high dimensional
09:15
representation of the state of the
09:17
system then
09:18
but many of these spins will be
09:20
correlated and we know from statistical
09:22
physics that actually you can
09:24
very well describe the microscopic state
09:26
of such a system
09:27
with just the magnetization
09:30
now what will happen if we come from the
09:32
other side from the data science
09:34
perspective and that's a
09:36
paper here an archive by sebastian
09:38
vetzel and what he did was
09:40
um he
09:44
did a simulation of a ferromagnetic
09:46
icing model on a 28 by 28 letters
09:49
at different temperatures and so 28 by
09:52
28 gives
09:53
you 784 spins right and we he recorded
09:57
the state of all these spins and 50 000
09:59
different samples at different
10:00
temperatures
10:02
and now you have a data matrix with 784
10:04
dimensions right and then you can start
10:07
to
10:07
say okay if i just would have this
10:10
matrix and i wouldn't know nothing about
10:11
the system how can i look for order in
10:14
this system and how can i reduce
10:15
dimension and one of these techniques is
10:18
called principal components analysis and
10:20
i'll explain later what this actually is
10:22
but if you do this you get a result like
10:24
this the first principal component
10:27
which is a one-dimensional
10:28
representation of of this
10:31
high-dimensional data matrix perfectly
10:33
scales with the magnetization
10:35
so in this way you would actually be
10:37
able to kind of discover
10:39
an order parameter without knowing
10:42
anything about the system
10:43
by having detailed microscopic data of
10:45
the system
10:47
so this is rather an artificial example
10:50
right because we know very well but how
10:52
how an ising model is behaved but what
10:54
about
10:55
systems where we don't really know how
10:57
they work for instance humans
11:00
um so this is a paper from nature in
11:02
2008
11:04
last authors bustamante and
11:07
what they did was they quantified human
11:09
dna
11:10
so they took 3 000 european humans
11:14
and now you can imagine that the dna is
11:18
a long string of
11:19
characters right and and to a large part
11:22
these this is conserved from one human
11:25
to
11:25
an to another human so the dna will be
11:28
pretty much the same
11:29
but at some positions we have variations
11:31
right the dna from one human to another
11:33
is not exactly the same and at half a
11:35
million positions
11:37
the orthos quantified the dna state
11:40
and so this gives you a a a data set
11:43
with a dimension of half a million
11:45
approximately for 3 000 humans and then
11:48
they did the dimensionality reduction
11:49
and again they did a
11:51
principal component analysis and this is
11:53
a two-dimensional representation of this
11:55
high dimensional data set
11:57
and what i want to ask you now is to
12:00
look at the picture and you don't
12:01
yet have to understand what the
12:03
principal components mean but i i want
12:05
to ask you to type in the ted
12:07
in the chat if you can what you see in
12:10
this picture
12:23
it's a very difficult or very shy
12:25
audience
12:30
what does it look like it look
12:33
if the meat looks like an elephant
12:38
it is a scatter plot yes and uh we see
12:41
some clusters yes
12:43
there's also uh some sensible answers
12:47
yes and it's actually um uh the uh
12:53
now i don't see my mouse anymore i think
12:57
adolfo got it right
12:58
yes i don't forgot it right it's already
13:00
the second answer
13:01
um it resembles a map of europe and
13:09
that's exactly what the title of the
13:11
paper was uh jeans mirrored geography
13:13
within europe right so
13:15
this was uh i think it's quite amazing
13:17
so you just
13:18
record the dna of humans right
13:21
and basically this gives you a map of
13:24
europe
13:25
approximately it kind of makes sense
13:26
right that the genes are
13:28
like um uh like um
13:31
well somewhat remain local
13:36
um so here's another example
13:40
um there's a very i'd say maybe
13:42
unphysical example but i think it
13:44
it works nicely these kind of examples
13:47
using
13:48
images work nicely to explain
13:50
dimensionality reduction techniques
13:52
so this is a fashion data set called
13:55
fashion mnist
13:56
provided by zalando research and
14:01
again you can uh these these images are
14:04
on a 28 by 28 letters
14:06
uh in a gray scale so you can record the
14:10
intensity of each pixel and you end up
14:12
with having 784 pixels on
14:14
of 60 000 pictures and there's different
14:17
kinds of pictures
14:18
like there's uh shirts and tops and
14:21
trousers
14:22
and there's also shoes i think in there
14:24
and then you can do a dimensionality
14:26
reduction and here
14:29
this is an example for a umap a uniform
14:32
manifold approximation i'll explain
14:34
later what this is it's a non-linear
14:36
dimensionality reduction technique
14:38
and again you can see a number of things
14:41
um
14:42
by just taking this as as vectors right
14:45
and
14:45
projecting them in two dimensions you
14:47
see that uh for instance
14:50
um i think this up here which is
14:51
trousers forms one cluster
14:54
which is very different for instance
14:56
from a cluster which consists of
14:59
sandals and sneakers and
15:02
ankle boots so this is the kind of shoes
15:04
cluster
15:05
uh you have bags here which seem to have
15:07
some similarity to
15:09
tops at some point in a certain way
15:13
so um this is a again dimensionality
15:16
reduction right and
15:18
you can see how you can how you can use
15:20
it to group together different kinds of
15:22
things in in this case just pictures uh
15:25
what you also see
15:26
here is that these things cluster so
15:28
there's a discrete kind of order
15:32
and i'll talk about clustering also
15:35
again
15:35
in the second part of the lecture
15:38
um so the now i'll go into
15:42
dimensionality reduction
15:44
and how it works and i'll start with a
15:46
linear dimensionality reduction
15:48
there's a number so so this is just the
15:50
linear transformation of your data
15:52
there's a number of techniques around um
15:55
for instance principle component
15:57
analysis i'll explain this and there's
15:59
other
15:59
uh techniques like uh non-negative
16:02
metrics factorization or independent
16:04
component analysis
16:06
or factor analysis they are they are
16:09
kind of similar in that they are linear
16:12
but i only have time to cover one of
16:14
them which is the easiest one which is
16:15
pca
16:17
um for that i'll again
16:21
share my ipad
16:41
all right what's principal component
16:42
analysis i'll first sketch the idea
16:47
um so for the i for forgetting the idea
16:50
it's sufficient to go into two
16:51
dimensions and i'll
16:53
make a plot a scatter plot of
16:56
two-dimensional data
16:58
where let's assume our data looks a bit
17:02
like this
17:06
all right it's essentially very well
17:08
correlated and
17:09
from looking at this now if you would uh
17:12
give me
17:12
x1 right and i would know x1 i could
17:16
very well predict x2 because i
17:18
i see that these two observables are
17:21
very well correlated and then
17:23
pca is a way to formalize
17:26
this and the first principle component
17:28
is designed in a way that it points in
17:30
the direction
17:32
of highest variability of the data and
17:35
in this example this would be this
17:36
direction
17:38
and this would be principal component
17:40
one and then the next principal
17:41
component
17:42
has to be orthogonal to the uh to pc1
17:46
and again point in the direction of uh
17:48
maximum variability of the data and
17:50
there's only one direction left here
17:52
which is
17:53
uh the orthogonal direction here and
17:54
that's pc2 principle component two
17:58
so then you just change the bases of
18:00
your data
18:03
through the directions of the principal
18:05
components
18:06
and then you end up with your
18:08
representation of your data
18:13
like this
18:19
where now in this directions your data
18:21
would be uncorrelated
18:24
and you would have
18:27
most of the variability in this
18:29
direction right
18:32
so the aim will be
18:38
to find a transformation
18:50
such that the components are
18:54
uncorrelated
19:03
and that and that they are ordered
19:08
by the explained variability
19:28
now how does the math of this look like
19:36
sorry
19:50
mathematically we would start with
19:53
a cr with an eigen decomposition of the
19:56
covariance matrix
19:58
um the
20:01
covariance matrix is an object which we
20:03
get from
20:04
or at least this is proportional to this
20:07
object it's the
20:08
data matrix transformed times the data
20:11
matrix itself
20:13
and you will see that this is um
20:16
a matrix of the shape m by m so it has
20:19
the dimensionality of
20:21
our data and
20:24
if you if you do the computations you
20:27
will see that the
20:28
diagonal elements of these objects are
20:31
proportional to the variability of our
20:33
data in this direction and the
20:35
off-diagonal elements
20:37
are proportional to the covariance of
20:39
components and then we do an item
20:46
decomposition
20:50
and from this we get m eigenvectors
20:53
lambda
20:53
i am eigenvalues lambda and a
20:57
and m i vectors and i group them
20:59
together in a matrix w
21:02
so the columns
21:07
of this matrix are eigenvectors
21:14
of the covariance matrix and importantly
21:18
what we do
21:19
here and we can always do this is we
21:21
sort them by the
21:23
we sort them by the magnitude
21:29
of the eigenvalues lambda i
21:33
and this ensures that the first
21:34
principal component points in the
21:35
direction of highest variability
21:38
now w is just the new uh
21:42
basis for our data and then we can
21:44
transform our data
21:47
so like this the the transformed data
21:51
will look like this we take the data
21:52
matrix
21:53
x times w
21:56
um so in the language of principal
21:58
components this is what's called
22:00
loadings
22:04
or also the rotation
22:09
and this the transformed data is called
22:11
scores
22:20
now i leave it to you as an exercise to
22:22
show that
22:24
the transformed data matrix
22:28
is a dynamic is a diagonal matrix
22:33
which basically means in this transform
22:36
space the components are uncorrelated
22:39
now the data matrix again here is of
22:43
shape
22:43
n by m
22:47
so the dimensionality of this
22:50
is just the same as the dimensionality
22:53
of our original data so how does this
22:56
help in reducing the dimensions
22:58
well there the idea is because the
23:00
eigenvectors are sorted
23:03
by the magnitude of lambda i that we can
23:05
truncate it
23:07
and if we define wr
23:11
which is an n by r matrix as the
23:16
first r columns of w
23:24
then we can get a a projection in lower
23:28
dimensions of our data
23:30
just by using rw
23:34
at wr
23:39
and this will this object will be of
23:40
shape n by r
23:42
and if we for instance select r equals
23:44
two we have a two dimensional
23:46
representation of our data of course
23:47
this
23:48
throws away some information but most of
23:50
the variability of our data is captured
23:59
um yes so so the take home is that that
24:03
is that principle component
24:07
analysis is a linear transformation of
24:09
the data such that the components
24:11
are uncorrelated and ordered by the
24:13
amount of explained variability
24:16
and what i want to do next is i want to
24:20
show you how you can actually do this
24:22
in r which is uh quite easy and you
24:25
don't actually need to know the math for
24:26
this
24:28
um so let me
24:31
um excuse me yes uh could you
24:34
repeat how do we decide uh at what
24:39
at which point to truncate w and what
24:42
exactly do we mean by sorting my
24:43
magnitude of the eigenvalues lambda
24:47
okay um if you
24:54
uh this this is the
24:58
covariance matrix right so essentially
25:00
if i
25:01
if i look at this um i saw you didn't
25:05
do you see my mouse yes yes
25:08
um so lambda 1
25:11
will be proportional to the variability
25:13
in this direction
25:14
okay yes and lambda 2 to the variability
25:18
in this direction
25:21
okay and and so this
25:24
sorting by uh
25:28
by the eigenvalues ensures that the
25:31
first component
25:32
of the first principal component will
25:35
point in this direction of highest
25:36
variability of the data
25:38
right yes
25:42
um so this is what the what the sorting
25:45
means
25:47
uh okay so in this case lambda 1 is
25:51
that the reference that we sort through
25:54
that we get w
25:55
from because lambda 1 is proportional to
25:57
the pc
25:58
one direction um
26:00
[Music]
26:02
lambda lambda 1 is the eigenvalue right
26:05
that corresponds to the
26:06
to the eigenvector which is in this
26:08
direction yes
26:10
yes and it is actually and if you
26:13
if you uh if you if you do kind of
26:17
um
26:17
[Music]
26:20
if you look at this at the transformed
26:22
cobra covariance matrix
26:24
right you will see that in the diagonal
26:26
elements of this you just have the uh
26:29
lambdas right so it's the that's the the
26:32
variable the variance in this direction
26:35
okay
26:36
yes thank you exactly so
26:39
was this all i think i remember
26:42
there was another question uh you also
26:44
asked how to drink it how to decide
26:46
uh uh where we trunk it
26:50
um this is not
26:53
this is up to us um you can kind of you
26:57
you can
26:57
uh what you can what i will show you in
26:59
a moment is that you can decide that you
27:01
wanna
27:02
keep like uh ninety percent of the
27:05
variability of your data and this would
27:06
give you a threshold of where you
27:08
where you stop for instance but
27:11
uh this threshold again is arbitrary so
27:13
you can also just
27:15
i mean one usual decision that you do is
27:17
if you want to visualize your data you
27:18
truncate at two
27:20
because that's easy to visualize um
27:23
you might also take this as an initial
27:25
step
27:26
of dimensionality reduction to maybe
27:29
just like
27:30
going from half a million to 50 or
27:32
something and then do a non-linear
27:33
transformation
27:35
but essentially like in my everyday work
27:37
this is a uh
27:39
it's an arbitrary it's often an
27:40
arbitrary choice and yeah in the end
27:42
have to see whether it works or not
27:44
whether it's sensible
27:45
so that's a tough question and
27:48
there's also some research still on that
27:51
okay um
28:03
all right so i prepared a
28:06
um a
28:10
a document a markdown
28:13
document a notebook in our studio and i
28:15
think stefan briefly introduced our
28:17
studio to do
28:18
to you last week you can also
28:22
get the code on from this url and it's
28:25
also
28:26
by now on the on the course home
28:28
homepage
28:30
um also the code from the last lecture i
28:33
forgot to mention that it's also on
28:34
github
28:37
in the same folder yeah exactly
28:40
and then um well i wanted to show you
28:43
before is that like i i want to just
28:45
show you as an example one of these uh
28:47
famous uh uh data sets you you might
28:51
have heard of the uh it's it's a data
28:52
set where uh
28:53
botanist and statistician fisher
28:56
collected
28:56
uh flowers from a flower called iris
29:01
and there's different species iris
29:04
guinea car aristotoza iris versicolor
29:07
and these have different leaves the
29:10
petal and the sepal
29:12
and he measured the the sample length
29:14
and the sample width and the petal
29:16
length and the petal with
29:17
all these species
29:21
and it's a it's a data set that you can
29:23
easily load in in
29:24
r that's why i used it and you can it's
29:27
four dimensional so here you see the
29:29
data it's uh
29:31
the length width and length and width
29:33
for one species
29:34
and then the other species follow
29:36
somewhere in the data
29:38
and you have this four numbers
29:41
now you can try to visualize this uh in
29:44
a
29:44
for instance pair wise ghetto plots
29:47
plotting the length
29:48
of the sepal against the sample width
29:50
and so on
29:51
and here this is color coded by species
29:54
and you already see some separation
29:56
but it's hard to choose like in this
29:58
four-dimensional plot
30:00
what you actually what to actually look
30:02
at and so you can perform a principal
30:04
component analysis on this
30:06
four-dimensional data set and then r
30:08
this is just a one-liner
30:10
so the principal components analysis is
30:12
done by p or comp
30:14
and here i plug in the four-dimensional
30:17
data set
30:20
and then you get all the things out that
30:23
we just talked about you get this uh
30:25
transformation matrix w called rotation
30:28
here
30:29
um where now you see that the
30:32
the principal the direction of principle
30:34
component one in this four dimensional
30:36
space
30:37
right it's it's a vector um you see
30:40
you can look at the transformed data
30:43
um which is again four dimensional
30:46
at first um here you can
30:50
uh look at the explained uh uh
30:53
variability or explain variance
30:55
um and this is maybe for the question of
30:59
where to
31:00
cut off so that they explained
31:03
variance here you can see the cumulative
31:05
proportion of this
31:07
and with the first two principal
31:09
component you explain
31:10
95 96 of the variability of your data
31:14
and you might say okay
31:16
that's enough with uh and in in pc4
31:19
uh you from pc3 to pc4 you hardly add
31:23
something so the fourth dimension of the
31:24
data set is probably very
31:26
it's very correlated to the first three
31:29
there's not much information left and
31:31
then this is how the data looks now
31:34
in in the direction of principal
31:35
components one and two and what you can
31:37
observe is that
31:39
uh cross principle component one that's
31:41
the main direction which kind of
31:44
um discriminates between the species
31:47
you can also see that the setoza is very
31:50
different from
31:51
versicolor and virginica and here it's
31:54
hard to actually draw a boundary
31:56
but just based on the measurements
32:01
this is what's called a scree plot and
32:03
this is just again a visualization of
32:05
the variance
32:06
explained by each principal component
32:10
now i want to also show you an another
32:13
data set
32:14
where actually principal component
32:16
analysis has a hard job
32:18
to work so this is
32:21
again an image data set where images
32:24
were collected of different objects and
32:25
they are rotated
32:27
right
32:31
and now you can again do a vector
32:34
representation of these images to a
32:35
principal component analysis and this is
32:38
what the
32:38
picture looks like and what the plot
32:41
looks like now and
32:42
and the colors now uh represent
32:45
different objects
32:46
and the dots of the same color are then
32:52
pictures of the objects at different
32:54
angles and you can see
32:56
that if you do this dimensionality
32:58
reduction you can see some structure
33:00
right but it's not really uh doesn't
33:03
really work good in discriminating the
33:05
different
33:06
objects and you would expect that maybe
33:09
you can
33:09
reveal this discrete kind of order
33:11
different kinds of objects
33:15
or that's maybe what you want by doing a
33:18
dimensionality reduction
33:23
here's another example that that i made
33:26
up
33:26
which is uh a dimensionality reduction
33:29
of a swiss roll so what's this with
33:31
actually a swiss roll with a hole so
33:33
what i did here was to say okay i
33:36
um uh randomly place
33:39
uh points on a on a two-dimensional
33:43
surface
33:43
but there's a hole in the middle and
33:45
then i wrap this up
33:47
in 3d like this um
33:50
i transform it like this okay it's
33:53
it's this role and it's it's called it's
33:56
it's a suite actually in in switzerland
33:58
that's why it's called a swiss roll and
34:00
desk you can still see this uh
34:02
this uh hole in the middle now what will
34:04
happen
34:05
if you do a principal components
34:07
analysis of this
34:10
the example is shown here and this is
34:13
the same
34:14
data shown in principle component space
34:21
actually in 3d and so
34:24
what you can see is that
34:28
not much happened right it's pretty much
34:30
the same as before
34:31
just stay the principal component and
34:34
the reason for this is
34:35
that what we do is a linear
34:38
transformation of the data right
34:40
but what i did before was to take two
34:42
dimensional data and
34:43
non-linearly
34:47
projected into 3d well use the
34:50
non-linear function
34:51
and so um projecting this back to 2d
34:54
in a way cannot work with a linear
34:58
dimensionality reduction technique and
34:59
that's why
35:02
non-linear dimensionality reduction
35:04
techniques are also very useful
35:07
so this leads me to the second part of
35:10
dimensionality reduction
35:13
which is about nonlinear dimensionality
35:15
reduction techniques and
35:17
as it's usually the case with non-linear
35:20
things
35:21
there's many of them and then actually
35:23
about one year ago
35:24
i had a look at wikipedia at the
35:28
like a list of some prominent algorithms
35:32
for dimensionality reduction and there
35:34
were 29 i'm sure now
35:36
there's more and this this doesn't claim
35:39
to be complete right so
35:41
um this makes very clear that depending
35:43
on what you want
35:44
there's different things that you can do
35:47
today i will concentrate on uniform
35:49
manifold approximation a rather
35:51
recent one introduced i think in 2018 so
35:54
three years ago
35:58
and it's used a lot in single cell
36:00
sequencing data in single cell genomics
36:03
nowadays
36:05
um so what is this
36:08
technique about so don't be scared
36:11
there's a lot of mathematics on this
36:13
page and you don't have to understand
36:15
all of that right away i won't also i
36:18
also won't have the time to fully
36:20
explain this in detail but i will give
36:22
you an intuition
36:23
of what this does uh
36:26
the basic idea is
36:30
that
36:32
[Music]
36:36
so the basic idea is that we have a
36:39
higher dimensional data set
36:41
but we hypothesize that there's a low
36:43
dimensional manifold so think of
36:45
a for means for instance there could be
36:47
a line in a two-dimensional manifold
36:50
right and then we want to discover this
36:53
dimensional
36:54
manifold so this
36:57
there's a couple of assumptions that are
36:59
made in new map one that the data is
37:01
distributed uniformly on a romanian
37:04
manifold
37:05
that the romanian metric is locally
37:07
constant and that the manifold is
37:09
locally connected
37:11
and what umap then gives you is
37:14
it models this manifold with a fuzzy
37:16
topological structure
37:18
and it finds a low dimensional
37:20
projection of the data
37:22
that has a very similar fuzzy
37:25
topological structure
37:26
to your original data so what does all
37:29
this mean
white board 1
37:30
um let's let's start
37:33
by i will start by by introducing
37:36
briefly very briefly introducing an an
37:40
important idea from topological data
37:42
analysis because
37:43
uh this algorithm is or this idea is
37:47
heavily based on topological data
37:50
analysis and this is simplices
37:53
a zero simplex is a point one simplex is
37:56
a connection of two zeros implices which
37:58
is a line then there's a two simplex
38:00
which is a triangle
38:02
a three simplex which is a tetrahedra
38:05
and you can see that you can easily
38:07
extend this contact
38:09
con concept to higher dimensions so you
38:12
can introduce a four simplex five
38:13
simplex and so on
38:16
and these objects are rather easy
38:19
to construct and the nice thing that you
38:22
can do is you can glue together such
38:24
objects such simplices
38:26
and by gluing them together
38:30
you can basically approximate um
38:33
money folds so um you might all have
38:37
seen already some point where you
38:39
uh triangulate the surface right and
38:41
this is nothing
38:42
then uh approximating this manifold with
38:45
simplices
38:51
now let's look at this example which i
38:54
took from the umap homepage actually
38:56
where you can see points which are
38:59
uniformly sampled actually equidistantly
39:02
sampled i think
39:03
on a not equidistant but uniformly
39:06
sampled on a
39:10
on a line
39:14
in a two-dimensional space right so we
39:16
have a one-dimensional manifold
39:18
in a two-dimensional space and
39:22
uh um so this would be could be our data
39:25
points that were measured
39:26
right and we would be interested in
39:28
finding the topological
39:30
structure of this manifold um so what
39:33
has been done
39:34
here is to draw some spheres around
39:36
these dots
39:38
and the union of all these spheres i
39:40
mean all these spheres are sets
39:42
and the union of all this is what's
39:44
called an open cover
39:45
open cover is just the thing that uh is
39:49
this is a um is a collection of sets
39:52
that span the whole manifold right so
39:54
the line
39:55
that we see here is inside this open
39:57
cover
39:58
and now you can do some heavy
40:00
topological and mathematics
40:02
and you can then prove that if you now
40:05
contract
40:05
uh construct simplices by uh all
40:09
groups of sets that in that that have a
40:12
non empty intersection which in this
40:14
case basically would mean
40:15
you construct a one simplex here line
40:18
here
40:19
and one simplex here and so on you i
40:21
think you get the picture
40:23
then if you do this and this would look
40:25
like
40:26
no um then this
40:29
structure this graph-like structure that
40:32
you get out of
40:33
this captures the topology of the
40:35
manifold
40:37
the information is encoded in there and
40:38
i think in this uh
40:40
two-dimensional example this this is uh
40:44
quite intuitive that the connection of
40:46
all these points
40:51
tells you about uh tells you that this
40:54
is a connected line
40:56
and there's no loops for instance
41:01
um so that's pretty cool and now we
41:04
would like to use this concept to
41:06
uh to learn something about our high
41:09
dimensional data
41:12
is the criterion for the connection
41:14
always like the
41:15
shortest distance between the points
41:18
because i mean here i could
41:20
see it but if i have more complex
41:22
structure
41:23
exactly exactly i'm actually coming to
41:25
this point in a second in this case
41:27
it was a bit like okay let's just guess
41:30
uh the size of the sphere
41:31
make it a good guess such that it works
41:33
right but if you have high dimensional
41:35
data
41:36
what would be the size of this sphere
41:37
that would be super unclear
41:39
so um i'll come back to this question in
41:42
a moment
41:46
it's also connected to a situation with
41:48
like this
41:49
right because um
41:52
our points might not actually be
41:54
uniformly distributed
41:55
okay so um if we would do the same thing
41:59
here
42:00
drawing these circles and
42:04
uh construct the simplices out of that
42:07
we would actually get disconnected parts
42:10
okay
42:11
and this is not what we would
42:13
intuitively like to get right
42:15
because here we
42:19
we saw this line um
42:24
now and this is also this kind of breaks
42:26
the assumption
42:28
of umap which is the uniform
42:30
distribution of uh
42:32
of data points on the money fold so
42:35
how can we circumvent this or how can we
42:37
solve this problem we can
42:38
kind of turn the problem upside down and
42:41
say
42:42
no these data points are uniformly
42:46
distributed we see different distances
42:50
if we apply a constant metric so this
42:52
has to be wrong the
42:53
metric is not constant actually we
42:56
couldn't
42:57
try to construct the metric like this
42:58
and this comes back to your question
43:01
we might say that the distance in this
43:02
case to the fifth
43:04
neighbor should always on average give
43:06
us a distance of five
43:08
right such that we say um distances
43:10
between neighboring data points should
43:12
on average be one
43:15
um and this would mean that that a
43:18
distance of
43:19
five for instance for this uh data point
43:22
would be rather large
43:23
and a distance of 5 for this standard
43:26
point would be rather small
43:28
so we define
43:34
we now use nearest neighbors k nearest
43:37
neighbors
43:38
to define a local metric
43:42
and we then turn this local metric
43:47
again and i'm skipping a lot of details
43:49
which are important here into a
43:53
construction of simplices into a
43:54
construction of graphs where the
43:56
edges now are weighted and the reason
44:00
that they are weighted now is that the
44:02
uh the distances between
44:04
uh points are different and also that uh
44:06
we actually have two measures of
44:09
um of distance one
44:12
using the local metric from here to
44:15
there and one using the local metric
44:16
from here to there because they might
44:18
not be the same
44:22
um but if you properly do the math
44:25
behind all that
44:26
you can show that this construction now
44:29
um captures essential parts of the
44:32
what's now called the fuzzy topology
44:35
of this manifold okay so we have a graph
44:39
representation now
44:41
that captures a topological structure
44:44
of our of the manifold we're interested
44:47
in
44:49
um how can we do that how can we use
44:50
this now to find low dimensional
44:52
representation
44:54
well the idea is that now you start
44:57
um you have this this graph defined and
45:01
now you start
45:02
with a low dimensional representation
45:05
you just place your points randomly
45:07
basically
45:08
and then you construct a similar
45:12
graph representation of your low
45:14
dimensional representation and then you
45:15
shuffle your points around
45:17
until the the topology the graph
45:20
representations kind of match
45:26
so yes you have a graph representation
45:30
of your high dimensional
45:31
data of your low dimensional data you
45:33
interpret the edge weights
45:35
as probabilities and then you minimize
45:38
the cross
45:39
entropy given here where basically
45:42
the important thing is that you try to
45:44
match the
45:46
edge weights in your higher dimensional
45:48
representation with the edge weights and
45:50
the low dimensional representation like
45:52
when this
45:53
ratio gets one the term vanishes and the
45:57
same happens over here
46:01
now how does this work in practice like
46:05
if you want to read more about this
46:06
there's a nice home page
46:11
i just wanted to show an animation here
46:15
which is called understanding umap there
46:18
for instance here you have an example
46:20
in 3d you have um dots
46:23
which run out from a center in a star
46:26
like fashion right and that's a 3d
46:28
represent this is a three-dimensional
46:30
data
46:31
and well actually it's
46:34
10 dimensional data in this case i think
46:37
and this is a three-dimensional
46:38
projection maybe
46:40
and now if you run umap
46:44
these dots are shuffled around randomly
46:47
and you stop
46:48
when you have a topological
46:50
representation which is very similar and
46:51
in this case this is also
46:53
this starship star-shaped pattern but
46:56
now in two dimensions
46:58
and if you want there's many more
47:00
examples here and you can
47:02
play around with this
47:09
um you can of course do the same thing
47:12
in
47:12
r um i i don't know
47:16
like i know this was a lot of
47:18
information and i've
47:19
i didn't go into the details are there
47:21
questions about umap at this point
47:28
otherwise i just show you how to do this
47:31
in r
47:35
it's essentially again a one-liner
47:38
where you have a function umap and what
47:41
i feed in here
47:43
is the swiss roll that i created
47:45
previously
47:47
okay and then um
47:50
i just i i do the um
47:54
dimensionality reduction here with a one
47:56
liner
47:57
then i convert the whole thing in a data
47:59
table object
48:00
to uh to have a nice plotting
48:03
and this is what the result looks like
48:05
now a new map coordinate
48:08
the the swiss role was kind of unrolled
48:11
but it
48:12
heavily lost its shape but what you can
48:15
see is that the topology is preserved
48:17
right
48:18
um so so the hole in the center is still
48:21
there
48:25
um i also want to show you a review map
48:29
representation of the coil 20 data that
48:31
i showed you above so remember this was
48:33
the data set
48:36
where we had pictures which are rotated
48:42
and where the the pca didn't look very
48:46
informative
48:48
now if we do the same thing with a umap
48:50
and we feed in these pictures this is
48:52
what we get
48:54
so most of the images nicely separate
48:57
right different colors mean different
48:59
objects
49:01
they nicely separate and what the umab
49:03
also gives us
49:04
is this kind is is the topology of this
49:07
data in the sense of that we can see
49:09
that this is a rotation
49:10
right so neighboring
49:14
pictures are similar and you
49:18
rotate the object until you come back to
49:20
the other side
49:27
okay i think yes that's it for um
49:30
[Music]
49:31
dimensionality reduction just one
49:35
example
49:36
for linear dimensionality reduction
49:37
which was printable components
49:39
and one for non-linear topology
49:42
preserving dimensionality reduction
49:44
which was umap and i now wanted to talk
49:47
a little bit about clustering
49:50
um so
49:53
you can as you already saw in some of
49:56
the
49:56
reduced dimensionality in in the some of
49:59
the
50:00
representations of the data and reduced
50:01
dimensions uh clusters can appear in
50:04
your data which
50:05
essentially mean you have qualitatively
50:08
different
50:09
objects in your data different different
50:11
samples and
50:15
you sometimes you you
50:18
and and it's good to have a way to group
50:22
different kinds of different classes of
50:24
objects together
50:26
and clustering is nothing
50:29
which is um there is no one definition
50:33
of clustering
50:34
the idea is to find clusters such that
50:37
the data within clusters are similar
50:40
while data from different clusters are
50:42
dissimilar
50:43
but what you define a similarity or
50:46
dissimilarity that's
50:48
basically up to you and so a lot of
50:50
different clustering algorithms exist
50:54
and i just wanted to introduce three of
50:56
them today
50:58
um and you use this a lot in or at least
51:01
i use this a lot in trying to understand
51:03
high dimensional data
51:05
i'm just going to share my
51:10
ipad again
51:27
and i wanted to show you first i want to
51:29
give you three examples and the first
51:30
one would be hierarchical clustering
51:32
which you
51:33
which uh
51:36
we're here i just plotted some
51:40
some data in two-dimensional space to
51:42
illustrate the concept of clustering
51:44
and what you need for hierarchical
51:47
clustering
51:48
is first of all a metric
51:54
which defines distance between your
51:57
objects
51:57
that you have here and this could just
52:00
be for instance the euclidean metric or
52:02
manhattan metric or a correlation
52:04
metric you're basically free to choose
52:06
and then what you do
52:08
is you um you group together
52:12
the two objects which are closest
52:15
together biometrics so with the
52:16
euclidean metric for this example
52:18
this would be those two guys right
52:30
so
52:35
what you then create is a kind of tree
52:37
where where um
52:39
okay we say i said we grow group
52:41
together those two guys so that's the
52:42
first thing to do
52:48
um and now next we need another thing
52:52
which is called the linkage criterion
53:03
um and this tells you how to now treat
53:06
these clusters because
53:08
uh now you need a way to uh
53:11
to give get the distance between the
53:14
clusters
53:21
and this could for instance be the
53:23
minimum distance
53:27
so this would mean if you want to now
53:29
get the
53:30
the distance between uh this cluster
53:33
and uh this object you would
53:36
take the distance between b and d and c
53:39
and d
53:39
and take the minimum distance or you
53:43
could also
53:44
for instance use the centroid there's
53:46
many of the clusters there's many
53:48
choices right
53:51
um now if we kind of go on we maybe take
53:54
the centroid here then the next cluster
53:56
would probably be a grouping of d
54:00
and e
54:06
then i'd say i would add
54:09
those three in one cluster
54:18
then group all these b c d
54:21
e and f until finally we arrive
54:25
at one at an object which
54:29
contains all the elements
54:35
and now if we want to define clusters in
54:38
our data we
54:39
need to define a precision threshold
54:42
along this axis
54:46
and we could for instance put it here
54:53
um and then
55:01
this would cut the tree in this
55:05
in in separate subtrees and we would
55:08
have a
55:08
subtree here that's up three here and
55:11
it's up to here and then this would be
55:12
our three different clusters so we would
55:14
have a cluster one
55:15
cluster two and a cluster three and if
55:18
we uh
55:20
increase the precision like up here then
55:22
we would get more and more clusters
55:28
and as this basically
55:31
just needs some uh
55:34
distance computations um it's it's this
55:37
is rather easily done in in
55:39
uh also in higher dimension i don't know
55:46
um
55:48
next i wanted to show you
55:58
um another important one which is a
56:02
centroid-based clustering called k-means
56:04
clustering and you
56:07
one comes across this quite often as
56:09
well when working with high-dimensional
56:10
data
56:11
so here in this case
56:17
um you pre-specify the number of
56:19
clusters k
56:20
so that's what you have to start with
56:23
how many clusters do you want to get
56:26
you might try out different case in
56:28
practice
56:29
and then you partition your data such
56:31
that the square distance to the cluster
56:34
mean position so that that would be the
56:37
centroid is minimal
56:41
um so how would this look like
56:45
um actually maybe let's look at the
56:49
final result
56:50
first we're kind of uh here you see an
56:53
algorithm converging but what you try to
56:55
do is to
56:55
talk to do a grouping of the data such
56:59
that oh no it doesn't stop there
57:06
sorry so so how do you how do you get
57:08
this partitioning of your data
57:11
one way would be the lloyd algorithm you
57:13
start with random centers that you place
57:15
somewhere so in this case
57:17
this is an example taking from wikipedia
57:20
there's a center here a center there in
57:22
the center there and then
57:23
for each of your data points you assign
57:26
the data point to the cluster
57:28
where the center is nearest right so the
57:30
yellow center is here so everything here
57:32
would be assigned to the yellow
57:34
cluster now once you did this you
57:39
update the centers of your clusters
57:42
now you take the centroid of this yellow
57:45
cluster for instance
57:46
and the yellow center ends up here
57:50
and now you do a new assignment of your
57:54
data points to the nearest center in
57:57
this case all the
57:58
yellow ones up here would be assigned to
58:01
the blue center actually
58:03
so they are assigned to the blue and
58:04
then you iteratively go on and do this
58:07
go on and go and go on until this
58:10
algorithm converges and you end up
58:12
having three clusters defined by this
58:16
criterion of course there's other more
58:17
complicated more sophisticated
58:19
algorithms but this approximates the
58:20
solution
58:27
a third algorithm that i wanted to show
58:29
you and i want to show you this because
58:31
i
58:31
work with it every day comes from social
58:34
systems which is a graph based
58:36
clustering
58:38
when you start with your high
58:39
dimensional data you first need a graph
58:41
representation of your data and we
58:43
already saw this in you
58:44
in the u-map example you could for
58:47
instance
58:48
do a k-nearest neighbor network on your
58:50
data and then
58:51
you partition your data such that the
58:53
modularity
58:54
queue is optimized um
58:59
modularity in this case is defined as
59:02
this object where where e i j are the
59:05
fraction of
59:06
links between cluster i and j
59:09
right and what you try to do then in
59:12
this object is that
59:14
the links between
59:17
cluster i and i which is the internal
59:19
links this
59:20
it's this eii right this is it's the
59:22
links inside the cluster
59:24
you try to maximize this
59:27
and uh you in in this ai term you have
59:30
the
59:30
ij from i to another cluster
59:34
so this is links out out of the cluster
59:36
you minimize this right so you have
59:38
very
59:42
clusters means a cluster in this case
59:45
means that it's
59:46
it's a
59:49
parts of your graph which are densely
59:51
connected inside but only sparsely
59:53
connected to the outside
59:56
there's a particular algorithm which is
59:59
used a lot nowadays which is the longer
60:01
algorithm so how do you solve this
60:03
because this is not
60:05
uh you can't so if this brute force
60:08
um you start with a network like this
60:12
for instance uh this would be your
60:15
nearest neighbor graph that you created
60:17
from your data and then you do copy
60:18
attempts you
60:19
start with node zero so all these now
60:23
you start with all these being separate
60:25
clusters
60:26
and then you copy cluster identity zero
60:29
to neighboring nodes
60:30
and you again compute the modularity and
60:32
if the modularity is increased you
60:34
accept this copy step
60:36
and if it's decreased you uh you reject
60:39
the copy step and you
60:40
do so um until you arrive
60:45
at a steady state like this where copy
60:47
attempts
60:49
don't increase your modularity anymore
60:52
would be in this case something like
60:54
that where you end up with four clusters
60:56
and then you do an aggregation and this
60:59
means
61:00
now that you
61:03
define a new graph representation of
61:05
your data where this
61:07
cluster here the blue cluster is now a
61:09
single node
61:12
it gets four internal connections
61:14
because there's
61:15
one two three four internal connections
61:19
and there is
61:20
four connections to cluster green
61:24
one two three four
61:28
and one connection to the light blue
61:30
cluster
61:31
and so you aggregate this you do the
61:33
copy attempt again
61:35
and so on and so on until you finally
61:38
arrive at a picture where you
61:40
where you um at a state where
61:44
you can't optimize modularity anymore
61:46
and this is um
61:48
how you this is how the algorithm works
61:50
to solve the problem
61:52
of modularity optimization and this
61:54
algorithm
61:56
works very well with many kinds of data
61:58
again for instance it's
62:00
used a lot to identify
62:03
cell types in a single cell genomics
62:12
all right that's it already i realize
62:14
i've been
62:15
quite fast we would have some time but
62:19
i'm i'm sorry
62:22
good timing one hour um
62:27
so just as a summary um the question
62:30
i wanted to talk about is how do you
62:32
explore high dimensional data
62:34
to identify order in this high
62:36
dimensional data set and i showed you
62:39
a couple of dimensionality reduction
62:41
techniques to visualize your
62:42
in-dimensional data
62:44
and that allow you to identify a small
62:46
set of
62:47
observables to describe the data um
62:51
i showed you a couple of approaches for
62:53
clustering to this
62:54
to identify discrete order in high
62:56
dimensional data and
62:58
i think one important point that i want
63:00
to make is the methods to use to work
63:03
with high dimensional data
63:05
depend on the problem at hand so really
63:08
depends on what you're looking for and
63:09
often it's just
63:10
trying out a lot of things to see what
63:12
makes sense
63:14
to understand your data and with this um
63:17
i'm at the end and
63:20
there's some references in the back
63:22
which i will upload uh
63:24
with the slides and with this yes i'm
63:26
i'm happy to take questions
63:29
okay perfect uh thanks a lot fabian yeah
63:32
thanks for taking the time and
63:35
especially since
63:36
uh data people like you are so in such
63:39
high demand these days and very busy so
63:41
it's great that you took the time to
63:43
to show us some of your stuff and
63:46
so uh i think what we usually do is that
63:49
we hang around a little bit
63:51
yeah on zoom and then if you have any
63:53
questions you just stay online
63:56
and ask your questions
63:59
yeah and uh other than uh that
64:02
yeah so then see you all next week and
64:05
there's already a question in the chat
64:13
there were questions in the chat during
64:15
the year during the lecture and i didn't
64:16
see the chat so yes you can't see that
64:18
very well if you're sharing the screen
64:20
yes but i think they were answered
64:22
already by other people
64:24
so so now the latest question is how
64:26
computationally intensive is um
64:29
that's that's a good question and it's
64:31
actually that's another uh
64:33
advantage of umap it's pretty fast
64:38
um it's
64:40
it also it's you don't you don't need a
64:42
lot of memory
64:43
um it scales very well with the
64:46
dimension of your data and also the
64:47
amount of data points so i think
64:49
it's it's um i mean just like
64:56
it's one of the fastest things around at
64:58
the moment um
65:01
i mean if you look at the typical data
65:02
sets uh that that you're looking at
65:05
where you have i mean 15 thousand
65:07
dimensions
65:08
and then maybe a few hundred thousand or
65:11
ten thousands of uh samples yeah you you
65:14
end up
65:15
with a few seconds yes yes on my machine
65:18
like the with the largest data sets i
65:20
have which are
65:21
of the order that you described um it
65:24
takes about
65:25
30 seconds maybe to compute the uh
65:29
compute the u map which is much faster
65:32
for instance there's
65:33
this name i don't know whether you know
65:36
about it but this was used a lot before
65:38
and
65:38
this is much slower usually
65:41
and actually there's um i didn't try
65:44
this out but there's also gpu
65:46
of implementations of umap nowadays and
65:48
this again
65:50
you can easily you can nicely
65:51
parallelize it and so this should be
65:53
much faster again
65:59
you're welcome
66:04
great thanks so are there any other
66:07
questions
66:08
um i have a question it's from
66:12
umap
66:15
so when we uh in all the examples of
66:18
umap we saw approximating a high
66:20
dimensional data
66:21
is there any isn't there any parameter
66:23
in the algorithm that quantifies
66:25
the uh
66:28
the acceptability of the approximation
66:30
whether the approximation is cur
66:32
how how much is it acceptable or not
66:34
because
66:35
or will it approximate any uh high
66:37
dimensional data
66:40
there didn't seem to be any parameter
66:41
that quantified or helped us to
66:44
understand if the result of the umap
66:47
approximation is
66:48
good or not uh obviously concerning the
66:51
problem at hand
66:53
i'm not fully sure whether i completely
66:56
understand the question but here are a
66:57
few thoughts on this um
67:02
so the approximation in umap is that you
67:05
you approximate
67:07
that um
67:10
your your approximation is that the data
67:12
is uniformly distributed
67:14
on the manifold right and the question
67:17
is
67:18
and maybe the question is is this a good
67:19
approximation or is it not a good
67:21
approximation
67:22
the problem is that we that we hardly
67:25
know the real metric
67:27
of the underlying space so
67:30
if you for instance think of these
67:33
examples with
67:34
pictures right what is a distance
67:36
between two pictures
67:37
what is this metric um
67:41
it's it's um i i think this is the you
67:45
you can't actually like
67:46
unless you have a unless you have a good
67:49
microscopic model
67:50
which in this case we don't have it's uh
67:53
not really possible to
67:55
to define a good metric and therefore to
67:58
judge whether whether the
68:00
whether the approximation is a good one
68:02
or not
68:04
um it's not
68:08
um yes i mean the way here that you you
68:12
do it is
68:13
there's no there's no uh right rigorous
68:16
way of doing that and that's why people
68:17
like fabian exist
68:18
who have experience and have consistency
68:21
checks
68:23
that with some experience can check
68:25
whether you map
68:26
makes sense or not yeah or whether it's
68:29
so one of the standard problem is that a
68:31
umap can be dominated by noise by
68:33
technical artifacts for example
68:35
yeah and then you need to have some
68:37
insights into data and some experience
68:40
that tell you whether uh whether uh
68:43
uh what you see actually represents
68:46
something that's real in the data
68:48
or whether that's something just just an
68:50
artifact you know so that's
68:52
uh that's basically one of the the jobs
68:55
that bioinformaticians or data science
68:58
scientists do
68:59
because there's just no rigorous way
69:02
where you just push a button and it
69:03
tells you whether
69:04
whether what you've done is is right or
69:06
not so
69:07
it needs some experience and some some
69:09
insights and understanding
69:11
of the data and the for example the ball
69:14
biology
69:15
uh that is underlying the the data
69:18
so typically with the typical um
69:22
method would be to use multiple
69:24
reduction methods and see which one is
69:26
giving us
69:27
an order which is much more perceptible
69:29
in our in our plots
69:31
you know i guess mean you also have to
69:34
to know the strengths and weaknesses of
69:36
each method
69:38
yeah for example what uh fabian told
69:41
about
69:41
previous uh before t-sne that had a
69:45
tendency
69:45
uh to perform to to give you clusters of
69:49
data points
69:50
yeah it was not very good at
69:52
representing global structures of data
69:54
sets
69:55
yeah and once you know that you can
69:57
interpret
69:58
interpret what you see and then you can
70:00
be careful
70:01
in interpreting uh basically clusters as
70:05
real physical or biological objects
70:09
yeah you maps the you are what you what
70:12
you've seen now
70:12
that overcame this limitation that it's
70:15
basically
70:16
also represented the global structure of
70:18
data quite well
70:20
but i guess also in umab you can have
70:22
artifacts
70:23
especially if your data has noise yeah
70:25
technical noise
70:27
you can have very weird artifacts so you
70:30
need to treat the data
70:32
before you can give it to youmap in in
70:34
typical instances
70:36
yeah so for example log you have to log
70:38
transform data very often
70:40
if your data is says spends huge orders
70:44
of magnitude
70:46
in quant in in in values yeah then then
70:49
sometimes you have the log
70:50
transformations for not the
70:52
the very few very large data points to
70:54
dominate everything
70:57
and so you have to treat the data in a
70:59
way to make it
71:00
say digestible for these methods
71:04
but then maybe i mean
71:07
the the other part of the question was
71:09
in in a way parameters and i didn't talk
71:10
about parameters at all and there are a
71:12
couple of parameters in the algorithm
71:14
um and i i skipped a lot of details but
71:17
the most important parameter
71:19
is the number of neighbors you look at
71:22
in your
71:24
when you construct the graph if you take
71:26
the first neighbor right you you
71:28
kind of um what you do is you
71:32
um estimate your metric very locally
71:36
um so so this is very this is very fine
71:38
but it's also very noisy so you often
71:40
will estimate the wrong metric if you
71:42
if you uh increase the number of
71:45
neighbors in there
71:46
um you will get a more robust estimate
71:50
of the local metric but you will
71:52
miss some fine structure of the data and
71:54
maybe actually there's a there's one
71:55
example that i can show
71:57
uh from here where if you look at
72:01
i don't know whether you can see this
72:02
here this is a
72:04
um maybe you can can you see this or is
72:07
this too small
72:08
i'm sorry you're not sharing the screen
72:12
ah sorry i don't share the screen so you
72:14
can't see anything
72:16
um let me show
72:18
[Music]
72:20
let me show this example um
72:22
[Music]
72:27
kind of if you that's maybe maybe a
72:32
actually a good example for for your
72:33
question where kind of okay here we have
72:35
a circle
72:37
with uh with dots and they on um
72:41
and then like like the distance between
72:43
the dots is not always the same right
72:46
um and so actually if we
72:49
take just this example and we say okay
72:51
here we know the euclidean metric
72:54
uh we can think of how how well does the
72:56
approximation now
72:58
or or how does umap behave if um
73:03
uh in this case right and
73:06
one thing that you can see here is that
73:08
if i increase the number of neighbors
73:13
that this will
73:16
usually lead to a
73:19
actually not
73:25
so you can see that sometimes you manage
73:28
to
73:28
for for some uh there is kind of an
73:31
optimum
73:33
um for looking at the number of
73:35
neighbors which allows you to actually
73:37
capture
73:39
um the the
73:42
circular topology the hole in the middle
73:44
right but
73:46
if you if like if you go too big you'd
73:48
kind of lose
73:49
to start to lose the information about
73:52
the order of these dots
73:54
of these uh measurements right they
73:57
start to get more and more noisy
73:59
if you look at less neighbors
74:03
you get nice ordered lines but you start
74:06
to miss
74:08
things where you have these gaps and you
74:10
start to create clusters
74:11
where there shouldn't be clusters so
74:14
this is the
74:15
most important
74:17
[Music]
74:19
parameter of the of the method
74:22
i'd say there is like if you if you uh
74:27
the umap home page the like from the
74:29
umap algorithm
74:31
they uh feature a nice section of what
74:33
parameters do you have for your map and
74:36
and uh
74:37
what influence do they have but it's
74:40
it is generally relatively robust also
74:44
compared to other methods
74:45
if you go for typical like
74:49
it just but this is just empirical kind
74:51
of if you go with
74:52
about 15 to 30 neighbors
74:56
or something you usually derive good
74:59
representations
75:00
of the topology of your data if you have
75:02
enough data points
75:04
um so it's relatively robust uh to the
75:08
choice of
75:08
of parameters you don't have to play
75:10
around with it too much
75:14
thank you
75:19
okay perfect very nice do we have any
75:21
other questions
75:22
yeah there's a question from kai muller
75:25
from the chat which i will read
75:26
which is uh he asked for high
75:28
dimensional data how does human
75:30
umap determine the manifold dimension
75:36
um
75:38
i don't think that it does
75:41
and
75:47
i don't think that it does and this
75:51
might be related to that it only
75:54
in the u-map algorithm what you do is
75:56
when you when you construct
75:59
this uh the graph representation
76:02
um from the intersecting
76:06
spheres you only look at the
76:10
one simplicity so you only construct a
76:12
graph and you don't look at
76:15
[Music]
76:18
at higher order simplices i'm not sure
76:20
what this could be used to
76:22
to say something about the
76:23
dimensionality
76:26
um of your manifold i'm i'm unsure there
76:31
but you might by default it doesn't tell
76:33
you the dimensionality
76:40
okay um any other question a lot of
76:44
questions today
76:55
okay so if there are no more questions
76:57
you know then uh
76:58
let's end the meeting you know thanks
77:00
all for joining and thanks again for
77:02
fabian thank you well thank you for
77:05
having me
77:06
yeah thank you okay perfect
77:10
yeah see you all next week bye
77:16
you