.mpipks-transcript | 10. Dimension Reduction

这一讲由嘉宾给出，官网没有课件。所以不再做课件帧数的标记了。作为替代，随堂做了笔记，参见[笔记部分]

white board 1

这一讲由嘉宾给出，官网没有课件。所以不再做课件帧数的标记了。作为替代，随堂做了笔记，参见笔记部分。

00:07 okay so let's 00:08 uh let's join uh so welcome 00:12 to our uh lecture again now to this uh 00:16 today we have a special guest as we uh 00:19 which is fabian 00:21 ross that you see on the other window 00:23 and as you remember 00:25 uh we before christmas we introduced uh 00:28 theoretical physics concepts that 00:30 explain how order emerges 00:33 in non-equilibrium systems and then last 00:36 week i started 00:37 introducing some basic tools 00:40 from data science namely what are the 00:43 tools that we can use 00:45 to computationally deal with large 00:48 amounts of data 00:50 and today so i need some special help 00:53 from somebody who's a real expert 00:56 which is far beyond you know fabian is a 00:59 trained 01:00 uh physicist and did a phd in 01:02 mathematical biology 01:04 i then did a postdoc in my group at the 01:07 max plain institute in dresden and now 01:10 he's a professional 01:12 uh data scientist and bioinformatician 01:16 at the center for regenerative therapies 01:18 in dresden 01:20 and so he's talking actually about 01:21 something that he's uh 01:23 doing for his professional job which is 01:25 very nice 01:26 and uh over to you fabian thanks a lot 01:29 for 01:30 for sharing your expertise yeah 01:33 thanks thanks for having me here uh for 01:36 giving miss 01:37 uh giving me this opportunity so to give 01:39 a lecture on 01:40 exploring high dimensional data so i 01:42 think uh stefan 01:44 introduced me very well thank you very 01:47 much and 01:48 and also made the point of where we are 01:51 in the course 01:53 so i just wanted to ask you i just would 01:56 uh to say that uh whenever you 01:58 uh wanna ask something i think during 02:00 the course you just like uh unmuted 02:01 yourself and ask the question so that's 02:03 totally fine to me also 02:05 type something in the chat and i try to 02:07 try to answer and then uh 02:09 like if you're not saying anything uh 02:11 please mute yourself to reduce noise 02:14 um 02:18 all right so um okay i'm 02:22 i'm going to start right away 02:25 um so today will be about um exploring 02:29 uh high dimensional data 02:30 right and as many of you 02:34 know with a high dimensional and in high 02:36 dimensions 02:37 strange things can happen so if you for 02:39 instance look at this thing here what 02:41 you see here is a three-dimensional 02:42 projection 02:43 of a four-dimensional object rotating 02:46 and this 02:47 looks very funny right and i mean like 02:49 four dimensions is where we kind of 02:51 still can kind of get a glimpse but but 02:53 what if it gets to 02:54 thousands or millions of dimensions what 02:56 do we do with these kind of data 02:59 um so before i start 03:02 uh let me let me give you a brief 03:04 structure of the course 03:06 today of the lecture um 03:18 might have to click into the window 03:21 no i actually want to share okay 03:25 my ipod screen yes sorry for the delay 03:29 um all right so so i basically start 03:33 by introducing what i mean by high 03:35 dimensional data 03:41 um i will then 03:44 cover two dimension 03:48 dimensionality reduction methods 03:51 [Music] 03:58 and also talk a bit like what's the idea 04:01 of of 04:01 reducing dimensions and third if we 04:04 still have time i'll talk a little bit 04:06 about 04:06 clustering which we kind of can think of 04:08 as a discrete form of dimensionality 04:11 reduction i'd say 04:13 so what do i mean by high dimensional 04:16 data 04:18 well let's let's start with an object 04:21 that i call a data matrix 04:28 um x which is of shape n by m 04:41 and maybe you remember from last week um 04:45 tidy data formats so this um uh 04:49 matrix should be in a tidy format 04:50 already so in the rows we have 04:52 n samples 04:57 and in the columns we have m 04:59 measurements or observables 05:08 um so what do i mean by this so so an 05:11 example could be what you what you 05:12 measure here is for instance you um 05:15 uh take a uh take yourself right and you 05:19 measure for instance the length of your 05:21 thumb and the length of your index 05:22 finger and the length of your middle 05:24 finger and so on you do the same thing 05:26 on there 05:26 with the other hand you measure your 05:28 foot size your body height 05:30 maybe uh the heartbeat rate and so on 05:32 and and uh 05:33 you end up gathering more and more 05:35 observables 05:38 and that's m observables and this gives 05:40 us the dimensionality of our data 05:44 um and now you can do this with yourself 05:47 and you can do this well you can ask 05:49 your friends to do it with them and so 05:50 on and then 05:52 you can maybe do it with all people in 05:55 asia all over the world and this uh 05:57 gives you the number of samples that you 05:58 have right 06:00 um now how do you how do you treat these 06:04 this kind of data right if we go one 06:06 step back 06:07 and uh think about how how can we 06:10 visualize 06:11 and uh stephan uploaded some slides on 06:14 that and and maybe we'll also talk on 06:16 that a little bit but 06:18 what can we what can we actually 06:20 visualize easily 06:22 um 06:28 so if you uh if you would have one 06:31 dimensional data right that's rather 06:33 easy right 06:34 you you can um just uh plot your data 06:38 on a one-dimensional line for instance 06:40 and also 06:41 for two dimensions 06:45 um i think this is pretty 06:47 straightforward right 06:49 um so if this would be the 06:53 entries of our data matrix x1 x2 06:57 um we can just plot them as a scatter 07:00 plot right 07:02 and we could for instance see whether 07:03 they are correlated or not 07:05 but um as soon as we are like 07:08 in in three dimensions or larger like 07:10 three dimensions we can still 07:11 handle we can do a three-dimensional 07:13 plot maybe but this is still 07:15 hard to quantify um this this gets um 07:19 this gets uh non-trivial right and so we 07:22 need techniques to 07:24 to uh think about how to actually look 07:26 at these data and how to analyze these 07:28 data so 07:30 the aim of dimensionality reduction then 07:35 i'd say is to reduce the number of 07:42 observables 07:50 while keeping 07:57 relevant information 08:04 and what this relevant information is 08:06 right this is something uh 08:08 you you have to decide by the problem at 08:11 hand 08:12 um now um 08:15 before i go 08:18 into talking about how you can reduce 08:21 dimensionality 08:22 i wanted to show you some examples 08:26 so i'll share some slides for this 08:33 um and so 08:37 i think as many physicists are in the 08:39 audience one example i wanted to share 08:41 something 08:41 that i think many of you are already 08:44 know 08:45 which is um where we know where many of 08:48 you know that a 08:49 low number of observables actually 08:51 suffice to to describe the system very 08:53 well and that's in the realm of 08:54 statistical physics and you 08:56 saw a lot of uh examples from stefan by 08:58 that 08:59 but even in a rather easy example like 09:01 an and paramagnetic 09:03 iso model right you know that there's uh 09:07 there's many many spins in this system 09:09 right and imagine you could measure all 09:11 these spins 09:12 um you have a very high dimensional 09:15 representation of the state of the 09:17 system then 09:18 but many of these spins will be 09:20 correlated and we know from statistical 09:22 physics that actually you can 09:24 very well describe the microscopic state 09:26 of such a system 09:27 with just the magnetization 09:30 now what will happen if we come from the 09:32 other side from the data science 09:34 perspective and that's a 09:36 paper here an archive by sebastian 09:38 vetzel and what he did was 09:40 um he 09:44 did a simulation of a ferromagnetic 09:46 icing model on a 28 by 28 letters 09:49 at different temperatures and so 28 by 09:52 28 gives 09:53 you 784 spins right and we he recorded 09:57 the state of all these spins and 50 000 09:59 different samples at different 10:00 temperatures 10:02 and now you have a data matrix with 784 10:04 dimensions right and then you can start 10:07 to 10:07 say okay if i just would have this 10:10 matrix and i wouldn't know nothing about 10:11 the system how can i look for order in 10:14 this system and how can i reduce 10:15 dimension and one of these techniques is 10:18 called principal components analysis and 10:20 i'll explain later what this actually is 10:22 but if you do this you get a result like 10:24 this the first principal component 10:27 which is a one-dimensional 10:28 representation of of this 10:31 high-dimensional data matrix perfectly 10:33 scales with the magnetization 10:35 so in this way you would actually be 10:37 able to kind of discover 10:39 an order parameter without knowing 10:42 anything about the system 10:43 by having detailed microscopic data of 10:45 the system 10:47 so this is rather an artificial example 10:50 right because we know very well but how 10:52 how an ising model is behaved but what 10:54 about 10:55 systems where we don't really know how 10:57 they work for instance humans 11:00 um so this is a paper from nature in 11:02 2008 11:04 last authors bustamante and 11:07 what they did was they quantified human 11:09 dna 11:10 so they took 3 000 european humans 11:14 and now you can imagine that the dna is 11:18 a long string of 11:19 characters right and and to a large part 11:22 these this is conserved from one human 11:25 to 11:25 an to another human so the dna will be 11:28 pretty much the same 11:29 but at some positions we have variations 11:31 right the dna from one human to another 11:33 is not exactly the same and at half a 11:35 million positions 11:37 the orthos quantified the dna state 11:40 and so this gives you a a a data set 11:43 with a dimension of half a million 11:45 approximately for 3 000 humans and then 11:48 they did the dimensionality reduction 11:49 and again they did a 11:51 principal component analysis and this is 11:53 a two-dimensional representation of this 11:55 high dimensional data set 11:57 and what i want to ask you now is to 12:00 look at the picture and you don't 12:01 yet have to understand what the 12:03 principal components mean but i i want 12:05 to ask you to type in the ted 12:07 in the chat if you can what you see in 12:10 this picture 12:23 it's a very difficult or very shy 12:25 audience 12:30 what does it look like it look 12:33 if the meat looks like an elephant 12:38 it is a scatter plot yes and uh we see 12:41 some clusters yes 12:43 there's also uh some sensible answers 12:47 yes and it's actually um uh the uh 12:53 now i don't see my mouse anymore i think 12:57 adolfo got it right 12:58 yes i don't forgot it right it's already 13:00 the second answer 13:01 um it resembles a map of europe and 13:09 that's exactly what the title of the 13:11 paper was uh jeans mirrored geography 13:13 within europe right so 13:15 this was uh i think it's quite amazing 13:17 so you just 13:18 record the dna of humans right 13:21 and basically this gives you a map of 13:24 europe 13:25 approximately it kind of makes sense 13:26 right that the genes are 13:28 like um uh like um 13:31 well somewhat remain local 13:36 um so here's another example 13:40 um there's a very i'd say maybe 13:42 unphysical example but i think it 13:44 it works nicely these kind of examples 13:47 using 13:48 images work nicely to explain 13:50 dimensionality reduction techniques 13:52 so this is a fashion data set called 13:55 fashion mnist 13:56 provided by zalando research and 14:01 again you can uh these these images are 14:04 on a 28 by 28 letters 14:06 uh in a gray scale so you can record the 14:10 intensity of each pixel and you end up 14:12 with having 784 pixels on 14:14 of 60 000 pictures and there's different 14:17 kinds of pictures 14:18 like there's uh shirts and tops and 14:21 trousers 14:22 and there's also shoes i think in there 14:24 and then you can do a dimensionality 14:26 reduction and here 14:29 this is an example for a umap a uniform 14:32 manifold approximation i'll explain 14:34 later what this is it's a non-linear 14:36 dimensionality reduction technique 14:38 and again you can see a number of things 14:41 um 14:42 by just taking this as as vectors right 14:45 and 14:45 projecting them in two dimensions you 14:47 see that uh for instance 14:50 um i think this up here which is 14:51 trousers forms one cluster 14:54 which is very different for instance 14:56 from a cluster which consists of 14:59 sandals and sneakers and 15:02 ankle boots so this is the kind of shoes 15:04 cluster 15:05 uh you have bags here which seem to have 15:07 some similarity to 15:09 tops at some point in a certain way 15:13 so um this is a again dimensionality 15:16 reduction right and 15:18 you can see how you can how you can use 15:20 it to group together different kinds of 15:22 things in in this case just pictures uh 15:25 what you also see 15:26 here is that these things cluster so 15:28 there's a discrete kind of order 15:32 and i'll talk about clustering also 15:35 again 15:35 in the second part of the lecture 15:38 um so the now i'll go into 15:42 dimensionality reduction 15:44 and how it works and i'll start with a 15:46 linear dimensionality reduction 15:48 there's a number so so this is just the 15:50 linear transformation of your data 15:52 there's a number of techniques around um 15:55 for instance principle component 15:57 analysis i'll explain this and there's 15:59 other 15:59 uh techniques like uh non-negative 16:02 metrics factorization or independent 16:04 component analysis 16:06 or factor analysis they are they are 16:09 kind of similar in that they are linear 16:12 but i only have time to cover one of 16:14 them which is the easiest one which is 16:15 pca 16:17 um for that i'll again 16:21 share my ipad 16:41 all right what's principal component 16:42 analysis i'll first sketch the idea 16:47 um so for the i for forgetting the idea 16:50 it's sufficient to go into two 16:51 dimensions and i'll 16:53 make a plot a scatter plot of 16:56 two-dimensional data 16:58 where let's assume our data looks a bit 17:02 like this 17:06 all right it's essentially very well 17:08 correlated and 17:09 from looking at this now if you would uh 17:12 give me 17:12 x1 right and i would know x1 i could 17:16 very well predict x2 because i 17:18 i see that these two observables are 17:21 very well correlated and then 17:23 pca is a way to formalize 17:26 this and the first principle component 17:28 is designed in a way that it points in 17:30 the direction 17:32 of highest variability of the data and 17:35 in this example this would be this 17:36 direction 17:38 and this would be principal component 17:40 one and then the next principal 17:41 component 17:42 has to be orthogonal to the uh to pc1 17:46 and again point in the direction of uh 17:48 maximum variability of the data and 17:50 there's only one direction left here 17:52 which is 17:53 uh the orthogonal direction here and 17:54 that's pc2 principle component two 17:58 so then you just change the bases of 18:00 your data 18:03 through the directions of the principal 18:05 components 18:06 and then you end up with your 18:08 representation of your data 18:13 like this 18:19 where now in this directions your data 18:21 would be uncorrelated 18:24 and you would have 18:27 most of the variability in this 18:29 direction right 18:32 so the aim will be 18:38 to find a transformation 18:50 such that the components are 18:54 uncorrelated 19:03 and that and that they are ordered 19:08 by the explained variability 19:28 now how does the math of this look like 19:36 sorry 19:50 mathematically we would start with 19:53 a cr with an eigen decomposition of the 19:56 covariance matrix 19:58 um the 20:01 covariance matrix is an object which we 20:03 get from 20:04 or at least this is proportional to this 20:07 object it's the 20:08 data matrix transformed times the data 20:11 matrix itself 20:13 and you will see that this is um 20:16 a matrix of the shape m by m so it has 20:19 the dimensionality of 20:21 our data and 20:24 if you if you do the computations you 20:27 will see that the 20:28 diagonal elements of these objects are 20:31 proportional to the variability of our 20:33 data in this direction and the 20:35 off-diagonal elements 20:37 are proportional to the covariance of 20:39 components and then we do an item 20:46 decomposition 20:50 and from this we get m eigenvectors 20:53 lambda 20:53 i am eigenvalues lambda and a 20:57 and m i vectors and i group them 20:59 together in a matrix w 21:02 so the columns 21:07 of this matrix are eigenvectors 21:14 of the covariance matrix and importantly 21:18 what we do 21:19 here and we can always do this is we 21:21 sort them by the 21:23 we sort them by the magnitude 21:29 of the eigenvalues lambda i 21:33 and this ensures that the first 21:34 principal component points in the 21:35 direction of highest variability 21:38 now w is just the new uh 21:42 basis for our data and then we can 21:44 transform our data 21:47 so like this the the transformed data 21:51 will look like this we take the data 21:52 matrix 21:53 x times w 21:56 um so in the language of principal 21:58 components this is what's called 22:00 loadings 22:04 or also the rotation 22:09 and this the transformed data is called 22:11 scores 22:20 now i leave it to you as an exercise to 22:22 show that 22:24 the transformed data matrix 22:28 is a dynamic is a diagonal matrix 22:33 which basically means in this transform 22:36 space the components are uncorrelated 22:39 now the data matrix again here is of 22:43 shape 22:43 n by m 22:47 so the dimensionality of this 22:50 is just the same as the dimensionality 22:53 of our original data so how does this 22:56 help in reducing the dimensions 22:58 well there the idea is because the 23:00 eigenvectors are sorted 23:03 by the magnitude of lambda i that we can 23:05 truncate it 23:07 and if we define wr 23:11 which is an n by r matrix as the 23:16 first r columns of w 23:24 then we can get a a projection in lower 23:28 dimensions of our data 23:30 just by using rw 23:34 at wr 23:39 and this will this object will be of 23:40 shape n by r 23:42 and if we for instance select r equals 23:44 two we have a two dimensional 23:46 representation of our data of course 23:47 this 23:48 throws away some information but most of 23:50 the variability of our data is captured 23:59 um yes so so the take home is that that 24:03 is that principle component 24:07 analysis is a linear transformation of 24:09 the data such that the components 24:11 are uncorrelated and ordered by the 24:13 amount of explained variability 24:16 and what i want to do next is i want to 24:20 show you how you can actually do this 24:22 in r which is uh quite easy and you 24:25 don't actually need to know the math for 24:26 this 24:28 um so let me 24:31 um excuse me yes uh could you 24:34 repeat how do we decide uh at what 24:39 at which point to truncate w and what 24:42 exactly do we mean by sorting my 24:43 magnitude of the eigenvalues lambda 24:47 okay um if you 24:54 uh this this is the 24:58 covariance matrix right so essentially 25:00 if i 25:01 if i look at this um i saw you didn't 25:05 do you see my mouse yes yes 25:08 um so lambda 1 25:11 will be proportional to the variability 25:13 in this direction 25:14 okay yes and lambda 2 to the variability 25:18 in this direction 25:21 okay and and so this 25:24 sorting by uh 25:28 by the eigenvalues ensures that the 25:31 first component 25:32 of the first principal component will 25:35 point in this direction of highest 25:36 variability of the data 25:38 right yes 25:42 um so this is what the what the sorting 25:45 means 25:47 uh okay so in this case lambda 1 is 25:51 that the reference that we sort through 25:54 that we get w 25:55 from because lambda 1 is proportional to 25:57 the pc 25:58 one direction um 26:00 [Music] 26:02 lambda lambda 1 is the eigenvalue right 26:05 that corresponds to the 26:06 to the eigenvector which is in this 26:08 direction yes 26:10 yes and it is actually and if you 26:13 if you uh if you if you do kind of 26:17 um 26:17 [Music] 26:20 if you look at this at the transformed 26:22 cobra covariance matrix 26:24 right you will see that in the diagonal 26:26 elements of this you just have the uh 26:29 lambdas right so it's the that's the the 26:32 variable the variance in this direction 26:35 okay 26:36 yes thank you exactly so 26:39 was this all i think i remember 26:42 there was another question uh you also 26:44 asked how to drink it how to decide 26:46 uh uh where we trunk it 26:50 um this is not 26:53 this is up to us um you can kind of you 26:57 you can 26:57 uh what you can what i will show you in 26:59 a moment is that you can decide that you 27:01 wanna 27:02 keep like uh ninety percent of the 27:05 variability of your data and this would 27:06 give you a threshold of where you 27:08 where you stop for instance but 27:11 uh this threshold again is arbitrary so 27:13 you can also just 27:15 i mean one usual decision that you do is 27:17 if you want to visualize your data you 27:18 truncate at two 27:20 because that's easy to visualize um 27:23 you might also take this as an initial 27:25 step 27:26 of dimensionality reduction to maybe 27:29 just like 27:30 going from half a million to 50 or 27:32 something and then do a non-linear 27:33 transformation 27:35 but essentially like in my everyday work 27:37 this is a uh 27:39 it's an arbitrary it's often an 27:40 arbitrary choice and yeah in the end 27:42 have to see whether it works or not 27:44 whether it's sensible 27:45 so that's a tough question and 27:48 there's also some research still on that 27:51 okay um 28:03 all right so i prepared a 28:06 um a 28:10 a document a markdown 28:13 document a notebook in our studio and i 28:15 think stefan briefly introduced our 28:17 studio to do 28:18 to you last week you can also 28:22 get the code on from this url and it's 28:25 also 28:26 by now on the on the course home 28:28 homepage 28:30 um also the code from the last lecture i 28:33 forgot to mention that it's also on 28:34 github 28:37 in the same folder yeah exactly 28:40 and then um well i wanted to show you 28:43 before is that like i i want to just 28:45 show you as an example one of these uh 28:47 famous uh uh data sets you you might 28:51 have heard of the uh it's it's a data 28:52 set where uh 28:53 botanist and statistician fisher 28:56 collected 28:56 uh flowers from a flower called iris 29:01 and there's different species iris 29:04 guinea car aristotoza iris versicolor 29:07 and these have different leaves the 29:10 petal and the sepal 29:12 and he measured the the sample length 29:14 and the sample width and the petal 29:16 length and the petal with 29:17 all these species 29:21 and it's a it's a data set that you can 29:23 easily load in in 29:24 r that's why i used it and you can it's 29:27 four dimensional so here you see the 29:29 data it's uh 29:31 the length width and length and width 29:33 for one species 29:34 and then the other species follow 29:36 somewhere in the data 29:38 and you have this four numbers 29:41 now you can try to visualize this uh in 29:44 a 29:44 for instance pair wise ghetto plots 29:47 plotting the length 29:48 of the sepal against the sample width 29:50 and so on 29:51 and here this is color coded by species 29:54 and you already see some separation 29:56 but it's hard to choose like in this 29:58 four-dimensional plot 30:00 what you actually what to actually look 30:02 at and so you can perform a principal 30:04 component analysis on this 30:06 four-dimensional data set and then r 30:08 this is just a one-liner 30:10 so the principal components analysis is 30:12 done by p or comp 30:14 and here i plug in the four-dimensional 30:17 data set 30:20 and then you get all the things out that 30:23 we just talked about you get this uh 30:25 transformation matrix w called rotation 30:28 here 30:29 um where now you see that the 30:32 the principal the direction of principle 30:34 component one in this four dimensional 30:36 space 30:37 right it's it's a vector um you see 30:40 you can look at the transformed data 30:43 um which is again four dimensional 30:46 at first um here you can 30:50 uh look at the explained uh uh 30:53 variability or explain variance 30:55 um and this is maybe for the question of 30:59 where to 31:00 cut off so that they explained 31:03 variance here you can see the cumulative 31:05 proportion of this 31:07 and with the first two principal 31:09 component you explain 31:10 95 96 of the variability of your data 31:14 and you might say okay 31:16 that's enough with uh and in in pc4 31:19 uh you from pc3 to pc4 you hardly add 31:23 something so the fourth dimension of the 31:24 data set is probably very 31:26 it's very correlated to the first three 31:29 there's not much information left and 31:31 then this is how the data looks now 31:34 in in the direction of principal 31:35 components one and two and what you can 31:37 observe is that 31:39 uh cross principle component one that's 31:41 the main direction which kind of 31:44 um discriminates between the species 31:47 you can also see that the setoza is very 31:50 different from 31:51 versicolor and virginica and here it's 31:54 hard to actually draw a boundary 31:56 but just based on the measurements 32:01 this is what's called a scree plot and 32:03 this is just again a visualization of 32:05 the variance 32:06 explained by each principal component 32:10 now i want to also show you an another 32:13 data set 32:14 where actually principal component 32:16 analysis has a hard job 32:18 to work so this is 32:21 again an image data set where images 32:24 were collected of different objects and 32:25 they are rotated 32:27 right 32:31 and now you can again do a vector 32:34 representation of these images to a 32:35 principal component analysis and this is 32:38 what the 32:38 picture looks like and what the plot 32:41 looks like now and 32:42 and the colors now uh represent 32:45 different objects 32:46 and the dots of the same color are then 32:52 pictures of the objects at different 32:54 angles and you can see 32:56 that if you do this dimensionality 32:58 reduction you can see some structure 33:00 right but it's not really uh doesn't 33:03 really work good in discriminating the 33:05 different 33:06 objects and you would expect that maybe 33:09 you can 33:09 reveal this discrete kind of order 33:11 different kinds of objects 33:15 or that's maybe what you want by doing a 33:18 dimensionality reduction 33:23 here's another example that that i made 33:26 up 33:26 which is uh a dimensionality reduction 33:29 of a swiss roll so what's this with 33:31 actually a swiss roll with a hole so 33:33 what i did here was to say okay i 33:36 um uh randomly place 33:39 uh points on a on a two-dimensional 33:43 surface 33:43 but there's a hole in the middle and 33:45 then i wrap this up 33:47 in 3d like this um 33:50 i transform it like this okay it's 33:53 it's this role and it's it's called it's 33:56 it's a suite actually in in switzerland 33:58 that's why it's called a swiss roll and 34:00 desk you can still see this uh 34:02 this uh hole in the middle now what will 34:04 happen 34:05 if you do a principal components 34:07 analysis of this 34:10 the example is shown here and this is 34:13 the same 34:14 data shown in principle component space 34:21 actually in 3d and so 34:24 what you can see is that 34:28 not much happened right it's pretty much 34:30 the same as before 34:31 just stay the principal component and 34:34 the reason for this is 34:35 that what we do is a linear 34:38 transformation of the data right 34:40 but what i did before was to take two 34:42 dimensional data and 34:43 non-linearly 34:47 projected into 3d well use the 34:50 non-linear function 34:51 and so um projecting this back to 2d 34:54 in a way cannot work with a linear 34:58 dimensionality reduction technique and 34:59 that's why 35:02 non-linear dimensionality reduction 35:04 techniques are also very useful 35:07 so this leads me to the second part of 35:10 dimensionality reduction 35:13 which is about nonlinear dimensionality 35:15 reduction techniques and 35:17 as it's usually the case with non-linear 35:20 things 35:21 there's many of them and then actually 35:23 about one year ago 35:24 i had a look at wikipedia at the 35:28 like a list of some prominent algorithms 35:32 for dimensionality reduction and there 35:34 were 29 i'm sure now 35:36 there's more and this this doesn't claim 35:39 to be complete right so 35:41 um this makes very clear that depending 35:43 on what you want 35:44 there's different things that you can do 35:47 today i will concentrate on uniform 35:49 manifold approximation a rather 35:51 recent one introduced i think in 2018 so 35:54 three years ago 35:58 and it's used a lot in single cell 36:00 sequencing data in single cell genomics 36:03 nowadays 36:05 um so what is this 36:08 technique about so don't be scared 36:11 there's a lot of mathematics on this 36:13 page and you don't have to understand 36:15 all of that right away i won't also i 36:18 also won't have the time to fully 36:20 explain this in detail but i will give 36:22 you an intuition 36:23 of what this does uh 36:26 the basic idea is 36:30 that 36:32 [Music] 36:36 so the basic idea is that we have a 36:39 higher dimensional data set 36:41 but we hypothesize that there's a low 36:43 dimensional manifold so think of 36:45 a for means for instance there could be 36:47 a line in a two-dimensional manifold 36:50 right and then we want to discover this 36:53 dimensional 36:54 manifold so this 36:57 there's a couple of assumptions that are 36:59 made in new map one that the data is 37:01 distributed uniformly on a romanian 37:04 manifold 37:05 that the romanian metric is locally 37:07 constant and that the manifold is 37:09 locally connected 37:11 and what umap then gives you is 37:14 it models this manifold with a fuzzy 37:16 topological structure 37:18 and it finds a low dimensional 37:20 projection of the data 37:22 that has a very similar fuzzy 37:25 topological structure 37:26 to your original data so what does all 37:29 this mean

white board 1

37:30 um let's let's start 37:33 by i will start by by introducing 37:36 briefly very briefly introducing an an 37:40 important idea from topological data 37:42 analysis because 37:43 uh this algorithm is or this idea is 37:47 heavily based on topological data 37:50 analysis and this is simplices 37:53 a zero simplex is a point one simplex is 37:56 a connection of two zeros implices which 37:58 is a line then there's a two simplex 38:00 which is a triangle 38:02 a three simplex which is a tetrahedra 38:05 and you can see that you can easily 38:07 extend this contact 38:09 con concept to higher dimensions so you 38:12 can introduce a four simplex five 38:13 simplex and so on 38:16 and these objects are rather easy 38:19 to construct and the nice thing that you 38:22 can do is you can glue together such 38:24 objects such simplices 38:26 and by gluing them together 38:30 you can basically approximate um 38:33 money folds so um you might all have 38:37 seen already some point where you 38:39 uh triangulate the surface right and 38:41 this is nothing 38:42 then uh approximating this manifold with 38:45 simplices 38:51 now let's look at this example which i 38:54 took from the umap homepage actually 38:56 where you can see points which are 38:59 uniformly sampled actually equidistantly 39:02 sampled i think 39:03 on a not equidistant but uniformly 39:06 sampled on a 39:10 on a line 39:14 in a two-dimensional space right so we 39:16 have a one-dimensional manifold 39:18 in a two-dimensional space and 39:22 uh um so this would be could be our data 39:25 points that were measured 39:26 right and we would be interested in 39:28 finding the topological 39:30 structure of this manifold um so what 39:33 has been done 39:34 here is to draw some spheres around 39:36 these dots 39:38 and the union of all these spheres i 39:40 mean all these spheres are sets 39:42 and the union of all this is what's 39:44 called an open cover 39:45 open cover is just the thing that uh is 39:49 this is a um is a collection of sets 39:52 that span the whole manifold right so 39:54 the line 39:55 that we see here is inside this open 39:57 cover 39:58 and now you can do some heavy 40:00 topological and mathematics 40:02 and you can then prove that if you now 40:05 contract 40:05 uh construct simplices by uh all 40:09 groups of sets that in that that have a 40:12 non empty intersection which in this 40:14 case basically would mean 40:15 you construct a one simplex here line 40:18 here 40:19 and one simplex here and so on you i 40:21 think you get the picture 40:23 then if you do this and this would look 40:25 like 40:26 no um then this 40:29 structure this graph-like structure that 40:32 you get out of 40:33 this captures the topology of the 40:35 manifold 40:37 the information is encoded in there and 40:38 i think in this uh 40:40 two-dimensional example this this is uh 40:44 quite intuitive that the connection of 40:46 all these points 40:51 tells you about uh tells you that this 40:54 is a connected line 40:56 and there's no loops for instance 41:01 um so that's pretty cool and now we 41:04 would like to use this concept to 41:06 uh to learn something about our high 41:09 dimensional data 41:12 is the criterion for the connection 41:14 always like the 41:15 shortest distance between the points 41:18 because i mean here i could 41:20 see it but if i have more complex 41:22 structure 41:23 exactly exactly i'm actually coming to 41:25 this point in a second in this case 41:27 it was a bit like okay let's just guess 41:30 uh the size of the sphere 41:31 make it a good guess such that it works 41:33 right but if you have high dimensional 41:35 data 41:36 what would be the size of this sphere 41:37 that would be super unclear 41:39 so um i'll come back to this question in 41:42 a moment 41:46 it's also connected to a situation with 41:48 like this 41:49 right because um 41:52 our points might not actually be 41:54 uniformly distributed 41:55 okay so um if we would do the same thing 41:59 here 42:00 drawing these circles and 42:04 uh construct the simplices out of that 42:07 we would actually get disconnected parts 42:10 okay 42:11 and this is not what we would 42:13 intuitively like to get right 42:15 because here we 42:19 we saw this line um 42:24 now and this is also this kind of breaks 42:26 the assumption 42:28 of umap which is the uniform 42:30 distribution of uh 42:32 of data points on the money fold so 42:35 how can we circumvent this or how can we 42:37 solve this problem we can 42:38 kind of turn the problem upside down and 42:41 say 42:42 no these data points are uniformly 42:46 distributed we see different distances 42:50 if we apply a constant metric so this 42:52 has to be wrong the 42:53 metric is not constant actually we 42:56 couldn't 42:57 try to construct the metric like this 42:58 and this comes back to your question 43:01 we might say that the distance in this 43:02 case to the fifth 43:04 neighbor should always on average give 43:06 us a distance of five 43:08 right such that we say um distances 43:10 between neighboring data points should 43:12 on average be one 43:15 um and this would mean that that a 43:18 distance of 43:19 five for instance for this uh data point 43:22 would be rather large 43:23 and a distance of 5 for this standard 43:26 point would be rather small 43:28 so we define 43:34 we now use nearest neighbors k nearest 43:37 neighbors 43:38 to define a local metric 43:42 and we then turn this local metric 43:47 again and i'm skipping a lot of details 43:49 which are important here into a 43:53 construction of simplices into a 43:54 construction of graphs where the 43:56 edges now are weighted and the reason 44:00 that they are weighted now is that the 44:02 uh the distances between 44:04 uh points are different and also that uh 44:06 we actually have two measures of 44:09 um of distance one 44:12 using the local metric from here to 44:15 there and one using the local metric 44:16 from here to there because they might 44:18 not be the same 44:22 um but if you properly do the math 44:25 behind all that 44:26 you can show that this construction now 44:29 um captures essential parts of the 44:32 what's now called the fuzzy topology 44:35 of this manifold okay so we have a graph 44:39 representation now 44:41 that captures a topological structure 44:44 of our of the manifold we're interested 44:47 in 44:49 um how can we do that how can we use 44:50 this now to find low dimensional 44:52 representation 44:54 well the idea is that now you start 44:57 um you have this this graph defined and 45:01 now you start 45:02 with a low dimensional representation 45:05 you just place your points randomly 45:07 basically 45:08 and then you construct a similar 45:12 graph representation of your low 45:14 dimensional representation and then you 45:15 shuffle your points around 45:17 until the the topology the graph 45:20 representations kind of match 45:26 so yes you have a graph representation 45:30 of your high dimensional 45:31 data of your low dimensional data you 45:33 interpret the edge weights 45:35 as probabilities and then you minimize 45:38 the cross 45:39 entropy given here where basically 45:42 the important thing is that you try to 45:44 match the 45:46 edge weights in your higher dimensional 45:48 representation with the edge weights and 45:50 the low dimensional representation like 45:52 when this 45:53 ratio gets one the term vanishes and the 45:57 same happens over here 46:01 now how does this work in practice like 46:05 if you want to read more about this 46:06 there's a nice home page 46:11 i just wanted to show an animation here 46:15 which is called understanding umap there 46:18 for instance here you have an example 46:20 in 3d you have um dots 46:23 which run out from a center in a star 46:26 like fashion right and that's a 3d 46:28 represent this is a three-dimensional 46:30 data 46:31 and well actually it's 46:34 10 dimensional data in this case i think 46:37 and this is a three-dimensional 46:38 projection maybe 46:40 and now if you run umap 46:44 these dots are shuffled around randomly 46:47 and you stop 46:48 when you have a topological 46:50 representation which is very similar and 46:51 in this case this is also 46:53 this starship star-shaped pattern but 46:56 now in two dimensions 46:58 and if you want there's many more 47:00 examples here and you can 47:02 play around with this 47:09 um you can of course do the same thing 47:12 in 47:12 r um i i don't know 47:16 like i know this was a lot of 47:18 information and i've 47:19 i didn't go into the details are there 47:21 questions about umap at this point 47:28 otherwise i just show you how to do this 47:31 in r 47:35 it's essentially again a one-liner 47:38 where you have a function umap and what 47:41 i feed in here 47:43 is the swiss roll that i created 47:45 previously 47:47 okay and then um 47:50 i just i i do the um 47:54 dimensionality reduction here with a one 47:56 liner 47:57 then i convert the whole thing in a data 47:59 table object 48:00 to uh to have a nice plotting 48:03 and this is what the result looks like 48:05 now a new map coordinate 48:08 the the swiss role was kind of unrolled 48:11 but it 48:12 heavily lost its shape but what you can 48:15 see is that the topology is preserved 48:17 right 48:18 um so so the hole in the center is still 48:21 there 48:25 um i also want to show you a review map 48:29 representation of the coil 20 data that 48:31 i showed you above so remember this was 48:33 the data set 48:36 where we had pictures which are rotated 48:42 and where the the pca didn't look very 48:46 informative 48:48 now if we do the same thing with a umap 48:50 and we feed in these pictures this is 48:52 what we get 48:54 so most of the images nicely separate 48:57 right different colors mean different 48:59 objects 49:01 they nicely separate and what the umab 49:03 also gives us 49:04 is this kind is is the topology of this 49:07 data in the sense of that we can see 49:09 that this is a rotation 49:10 right so neighboring 49:14 pictures are similar and you 49:18 rotate the object until you come back to 49:20 the other side 49:27 okay i think yes that's it for um 49:30 [Music] 49:31 dimensionality reduction just one 49:35 example 49:36 for linear dimensionality reduction 49:37 which was printable components 49:39 and one for non-linear topology 49:42 preserving dimensionality reduction 49:44 which was umap and i now wanted to talk 49:47 a little bit about clustering 49:50 um so 49:53 you can as you already saw in some of 49:56 the 49:56 reduced dimensionality in in the some of 49:59 the 50:00 representations of the data and reduced 50:01 dimensions uh clusters can appear in 50:04 your data which 50:05 essentially mean you have qualitatively 50:08 different 50:09 objects in your data different different 50:11 samples and 50:15 you sometimes you you 50:18 and and it's good to have a way to group 50:22 different kinds of different classes of 50:24 objects together 50:26 and clustering is nothing 50:29 which is um there is no one definition 50:33 of clustering 50:34 the idea is to find clusters such that 50:37 the data within clusters are similar 50:40 while data from different clusters are 50:42 dissimilar 50:43 but what you define a similarity or 50:46 dissimilarity that's 50:48 basically up to you and so a lot of 50:50 different clustering algorithms exist 50:54 and i just wanted to introduce three of 50:56 them today 50:58 um and you use this a lot in or at least 51:01 i use this a lot in trying to understand 51:03 high dimensional data 51:05 i'm just going to share my 51:10 ipad again 51:27 and i wanted to show you first i want to 51:29 give you three examples and the first 51:30 one would be hierarchical clustering 51:32 which you 51:33 which uh 51:36 we're here i just plotted some 51:40 some data in two-dimensional space to 51:42 illustrate the concept of clustering 51:44 and what you need for hierarchical 51:47 clustering 51:48 is first of all a metric 51:54 which defines distance between your 51:57 objects 51:57 that you have here and this could just 52:00 be for instance the euclidean metric or 52:02 manhattan metric or a correlation 52:04 metric you're basically free to choose 52:06 and then what you do 52:08 is you um you group together 52:12 the two objects which are closest 52:15 together biometrics so with the 52:16 euclidean metric for this example 52:18 this would be those two guys right 52:30 so 52:35 what you then create is a kind of tree 52:37 where where um 52:39 okay we say i said we grow group 52:41 together those two guys so that's the 52:42 first thing to do 52:48 um and now next we need another thing 52:52 which is called the linkage criterion 53:03 um and this tells you how to now treat 53:06 these clusters because 53:08 uh now you need a way to uh 53:11 to give get the distance between the 53:14 clusters 53:21 and this could for instance be the 53:23 minimum distance 53:27 so this would mean if you want to now 53:29 get the 53:30 the distance between uh this cluster 53:33 and uh this object you would 53:36 take the distance between b and d and c 53:39 and d 53:39 and take the minimum distance or you 53:43 could also 53:44 for instance use the centroid there's 53:46 many of the clusters there's many 53:48 choices right 53:51 um now if we kind of go on we maybe take 53:54 the centroid here then the next cluster 53:56 would probably be a grouping of d 54:00 and e 54:06 then i'd say i would add 54:09 those three in one cluster 54:18 then group all these b c d 54:21 e and f until finally we arrive 54:25 at one at an object which 54:29 contains all the elements 54:35 and now if we want to define clusters in 54:38 our data we 54:39 need to define a precision threshold 54:42 along this axis 54:46 and we could for instance put it here 54:53 um and then 55:01 this would cut the tree in this 55:05 in in separate subtrees and we would 55:08 have a 55:08 subtree here that's up three here and 55:11 it's up to here and then this would be 55:12 our three different clusters so we would 55:14 have a cluster one 55:15 cluster two and a cluster three and if 55:18 we uh 55:20 increase the precision like up here then 55:22 we would get more and more clusters 55:28 and as this basically 55:31 just needs some uh 55:34 distance computations um it's it's this 55:37 is rather easily done in in 55:39 uh also in higher dimension i don't know 55:46 um 55:48 next i wanted to show you 55:58 um another important one which is a 56:02 centroid-based clustering called k-means 56:04 clustering and you 56:07 one comes across this quite often as 56:09 well when working with high-dimensional 56:10 data 56:11 so here in this case 56:17 um you pre-specify the number of 56:19 clusters k 56:20 so that's what you have to start with 56:23 how many clusters do you want to get 56:26 you might try out different case in 56:28 practice 56:29 and then you partition your data such 56:31 that the square distance to the cluster 56:34 mean position so that that would be the 56:37 centroid is minimal 56:41 um so how would this look like 56:45 um actually maybe let's look at the 56:49 final result 56:50 first we're kind of uh here you see an 56:53 algorithm converging but what you try to 56:55 do is to 56:55 talk to do a grouping of the data such 56:59 that oh no it doesn't stop there 57:06 sorry so so how do you how do you get 57:08 this partitioning of your data 57:11 one way would be the lloyd algorithm you 57:13 start with random centers that you place 57:15 somewhere so in this case 57:17 this is an example taking from wikipedia 57:20 there's a center here a center there in 57:22 the center there and then 57:23 for each of your data points you assign 57:26 the data point to the cluster 57:28 where the center is nearest right so the 57:30 yellow center is here so everything here 57:32 would be assigned to the yellow 57:34 cluster now once you did this you 57:39 update the centers of your clusters 57:42 now you take the centroid of this yellow 57:45 cluster for instance 57:46 and the yellow center ends up here 57:50 and now you do a new assignment of your 57:54 data points to the nearest center in 57:57 this case all the 57:58 yellow ones up here would be assigned to 58:01 the blue center actually 58:03 so they are assigned to the blue and 58:04 then you iteratively go on and do this 58:07 go on and go and go on until this 58:10 algorithm converges and you end up 58:12 having three clusters defined by this 58:16 criterion of course there's other more 58:17 complicated more sophisticated 58:19 algorithms but this approximates the 58:20 solution 58:27 a third algorithm that i wanted to show 58:29 you and i want to show you this because 58:31 i 58:31 work with it every day comes from social 58:34 systems which is a graph based 58:36 clustering 58:38 when you start with your high 58:39 dimensional data you first need a graph 58:41 representation of your data and we 58:43 already saw this in you 58:44 in the u-map example you could for 58:47 instance 58:48 do a k-nearest neighbor network on your 58:50 data and then 58:51 you partition your data such that the 58:53 modularity 58:54 queue is optimized um 58:59 modularity in this case is defined as 59:02 this object where where e i j are the 59:05 fraction of 59:06 links between cluster i and j 59:09 right and what you try to do then in 59:12 this object is that 59:14 the links between 59:17 cluster i and i which is the internal 59:19 links this 59:20 it's this eii right this is it's the 59:22 links inside the cluster 59:24 you try to maximize this 59:27 and uh you in in this ai term you have 59:30 the 59:30 ij from i to another cluster 59:34 so this is links out out of the cluster 59:36 you minimize this right so you have 59:38 very 59:42 clusters means a cluster in this case 59:45 means that it's 59:46 it's a 59:49 parts of your graph which are densely 59:51 connected inside but only sparsely 59:53 connected to the outside 59:56 there's a particular algorithm which is 59:59 used a lot nowadays which is the longer 60:01 algorithm so how do you solve this 60:03 because this is not 60:05 uh you can't so if this brute force 60:08 um you start with a network like this 60:12 for instance uh this would be your 60:15 nearest neighbor graph that you created 60:17 from your data and then you do copy 60:18 attempts you 60:19 start with node zero so all these now 60:23 you start with all these being separate 60:25 clusters 60:26 and then you copy cluster identity zero 60:29 to neighboring nodes 60:30 and you again compute the modularity and 60:32 if the modularity is increased you 60:34 accept this copy step 60:36 and if it's decreased you uh you reject 60:39 the copy step and you 60:40 do so um until you arrive 60:45 at a steady state like this where copy 60:47 attempts 60:49 don't increase your modularity anymore 60:52 would be in this case something like 60:54 that where you end up with four clusters 60:56 and then you do an aggregation and this 60:59 means 61:00 now that you 61:03 define a new graph representation of 61:05 your data where this 61:07 cluster here the blue cluster is now a 61:09 single node 61:12 it gets four internal connections 61:14 because there's 61:15 one two three four internal connections 61:19 and there is 61:20 four connections to cluster green 61:24 one two three four 61:28 and one connection to the light blue 61:30 cluster 61:31 and so you aggregate this you do the 61:33 copy attempt again 61:35 and so on and so on until you finally 61:38 arrive at a picture where you 61:40 where you um at a state where 61:44 you can't optimize modularity anymore 61:46 and this is um 61:48 how you this is how the algorithm works 61:50 to solve the problem 61:52 of modularity optimization and this 61:54 algorithm 61:56 works very well with many kinds of data 61:58 again for instance it's 62:00 used a lot to identify 62:03 cell types in a single cell genomics 62:12 all right that's it already i realize 62:14 i've been 62:15 quite fast we would have some time but 62:19 i'm i'm sorry 62:22 good timing one hour um 62:27 so just as a summary um the question 62:30 i wanted to talk about is how do you 62:32 explore high dimensional data 62:34 to identify order in this high 62:36 dimensional data set and i showed you 62:39 a couple of dimensionality reduction 62:41 techniques to visualize your 62:42 in-dimensional data 62:44 and that allow you to identify a small 62:46 set of 62:47 observables to describe the data um 62:51 i showed you a couple of approaches for 62:53 clustering to this 62:54 to identify discrete order in high 62:56 dimensional data and 62:58 i think one important point that i want 63:00 to make is the methods to use to work 63:03 with high dimensional data 63:05 depend on the problem at hand so really 63:08 depends on what you're looking for and 63:09 often it's just 63:10 trying out a lot of things to see what 63:12 makes sense 63:14 to understand your data and with this um 63:17 i'm at the end and 63:20 there's some references in the back 63:22 which i will upload uh 63:24 with the slides and with this yes i'm 63:26 i'm happy to take questions 63:29 okay perfect uh thanks a lot fabian yeah 63:32 thanks for taking the time and 63:35 especially since 63:36 uh data people like you are so in such 63:39 high demand these days and very busy so 63:41 it's great that you took the time to 63:43 to show us some of your stuff and 63:46 so uh i think what we usually do is that 63:49 we hang around a little bit 63:51 yeah on zoom and then if you have any 63:53 questions you just stay online 63:56 and ask your questions 63:59 yeah and uh other than uh that 64:02 yeah so then see you all next week and 64:05 there's already a question in the chat 64:13 there were questions in the chat during 64:15 the year during the lecture and i didn't 64:16 see the chat so yes you can't see that 64:18 very well if you're sharing the screen 64:20 yes but i think they were answered 64:22 already by other people 64:24 so so now the latest question is how 64:26 computationally intensive is um 64:29 that's that's a good question and it's 64:31 actually that's another uh 64:33 advantage of umap it's pretty fast 64:38 um it's 64:40 it also it's you don't you don't need a 64:42 lot of memory 64:43 um it scales very well with the 64:46 dimension of your data and also the 64:47 amount of data points so i think 64:49 it's it's um i mean just like 64:56 it's one of the fastest things around at 64:58 the moment um 65:01 i mean if you look at the typical data 65:02 sets uh that that you're looking at 65:05 where you have i mean 15 thousand 65:07 dimensions 65:08 and then maybe a few hundred thousand or 65:11 ten thousands of uh samples yeah you you 65:14 end up 65:15 with a few seconds yes yes on my machine 65:18 like the with the largest data sets i 65:20 have which are 65:21 of the order that you described um it 65:24 takes about 65:25 30 seconds maybe to compute the uh 65:29 compute the u map which is much faster 65:32 for instance there's 65:33 this name i don't know whether you know 65:36 about it but this was used a lot before 65:38 and 65:38 this is much slower usually 65:41 and actually there's um i didn't try 65:44 this out but there's also gpu 65:46 of implementations of umap nowadays and 65:48 this again 65:50 you can easily you can nicely 65:51 parallelize it and so this should be 65:53 much faster again 65:59 you're welcome 66:04 great thanks so are there any other 66:07 questions 66:08 um i have a question it's from 66:12 umap 66:15 so when we uh in all the examples of 66:18 umap we saw approximating a high 66:20 dimensional data 66:21 is there any isn't there any parameter 66:23 in the algorithm that quantifies 66:25 the uh 66:28 the acceptability of the approximation 66:30 whether the approximation is cur 66:32 how how much is it acceptable or not 66:34 because 66:35 or will it approximate any uh high 66:37 dimensional data 66:40 there didn't seem to be any parameter 66:41 that quantified or helped us to 66:44 understand if the result of the umap 66:47 approximation is 66:48 good or not uh obviously concerning the 66:51 problem at hand 66:53 i'm not fully sure whether i completely 66:56 understand the question but here are a 66:57 few thoughts on this um 67:02 so the approximation in umap is that you 67:05 you approximate 67:07 that um 67:10 your your approximation is that the data 67:12 is uniformly distributed 67:14 on the manifold right and the question 67:17 is 67:18 and maybe the question is is this a good 67:19 approximation or is it not a good 67:21 approximation 67:22 the problem is that we that we hardly 67:25 know the real metric 67:27 of the underlying space so 67:30 if you for instance think of these 67:33 examples with 67:34 pictures right what is a distance 67:36 between two pictures 67:37 what is this metric um 67:41 it's it's um i i think this is the you 67:45 you can't actually like 67:46 unless you have a unless you have a good 67:49 microscopic model 67:50 which in this case we don't have it's uh 67:53 not really possible to 67:55 to define a good metric and therefore to 67:58 judge whether whether the 68:00 whether the approximation is a good one 68:02 or not 68:04 um it's not 68:08 um yes i mean the way here that you you 68:12 do it is 68:13 there's no there's no uh right rigorous 68:16 way of doing that and that's why people 68:17 like fabian exist 68:18 who have experience and have consistency 68:21 checks 68:23 that with some experience can check 68:25 whether you map 68:26 makes sense or not yeah or whether it's 68:29 so one of the standard problem is that a 68:31 umap can be dominated by noise by 68:33 technical artifacts for example 68:35 yeah and then you need to have some 68:37 insights into data and some experience 68:40 that tell you whether uh whether uh 68:43 uh what you see actually represents 68:46 something that's real in the data 68:48 or whether that's something just just an 68:50 artifact you know so that's 68:52 uh that's basically one of the the jobs 68:55 that bioinformaticians or data science 68:58 scientists do 68:59 because there's just no rigorous way 69:02 where you just push a button and it 69:03 tells you whether 69:04 whether what you've done is is right or 69:06 not so 69:07 it needs some experience and some some 69:09 insights and understanding 69:11 of the data and the for example the ball 69:14 biology 69:15 uh that is underlying the the data 69:18 so typically with the typical um 69:22 method would be to use multiple 69:24 reduction methods and see which one is 69:26 giving us 69:27 an order which is much more perceptible 69:29 in our in our plots 69:31 you know i guess mean you also have to 69:34 to know the strengths and weaknesses of 69:36 each method 69:38 yeah for example what uh fabian told 69:41 about 69:41 previous uh before t-sne that had a 69:45 tendency 69:45 uh to perform to to give you clusters of 69:49 data points 69:50 yeah it was not very good at 69:52 representing global structures of data 69:54 sets 69:55 yeah and once you know that you can 69:57 interpret 69:58 interpret what you see and then you can 70:00 be careful 70:01 in interpreting uh basically clusters as 70:05 real physical or biological objects 70:09 yeah you maps the you are what you what 70:12 you've seen now 70:12 that overcame this limitation that it's 70:15 basically 70:16 also represented the global structure of 70:18 data quite well 70:20 but i guess also in umab you can have 70:22 artifacts 70:23 especially if your data has noise yeah 70:25 technical noise 70:27 you can have very weird artifacts so you 70:30 need to treat the data 70:32 before you can give it to youmap in in 70:34 typical instances 70:36 yeah so for example log you have to log 70:38 transform data very often 70:40 if your data is says spends huge orders 70:44 of magnitude 70:46 in quant in in in values yeah then then 70:49 sometimes you have the log 70:50 transformations for not the 70:52 the very few very large data points to 70:54 dominate everything 70:57 and so you have to treat the data in a 70:59 way to make it 71:00 say digestible for these methods 71:04 but then maybe i mean 71:07 the the other part of the question was 71:09 in in a way parameters and i didn't talk 71:10 about parameters at all and there are a 71:12 couple of parameters in the algorithm 71:14 um and i i skipped a lot of details but 71:17 the most important parameter 71:19 is the number of neighbors you look at 71:22 in your 71:24 when you construct the graph if you take 71:26 the first neighbor right you you 71:28 kind of um what you do is you 71:32 um estimate your metric very locally 71:36 um so so this is very this is very fine 71:38 but it's also very noisy so you often 71:40 will estimate the wrong metric if you 71:42 if you uh increase the number of 71:45 neighbors in there 71:46 um you will get a more robust estimate 71:50 of the local metric but you will 71:52 miss some fine structure of the data and 71:54 maybe actually there's a there's one 71:55 example that i can show 71:57 uh from here where if you look at 72:01 i don't know whether you can see this 72:02 here this is a 72:04 um maybe you can can you see this or is 72:07 this too small 72:08 i'm sorry you're not sharing the screen 72:12 ah sorry i don't share the screen so you 72:14 can't see anything 72:16 um let me show 72:18 [Music] 72:20 let me show this example um 72:22 [Music] 72:27 kind of if you that's maybe maybe a 72:32 actually a good example for for your 72:33 question where kind of okay here we have 72:35 a circle 72:37 with uh with dots and they on um 72:41 and then like like the distance between 72:43 the dots is not always the same right 72:46 um and so actually if we 72:49 take just this example and we say okay 72:51 here we know the euclidean metric 72:54 uh we can think of how how well does the 72:56 approximation now 72:58 or or how does umap behave if um 73:03 uh in this case right and 73:06 one thing that you can see here is that 73:08 if i increase the number of neighbors 73:13 that this will 73:16 usually lead to a 73:19 actually not 73:25 so you can see that sometimes you manage 73:28 to 73:28 for for some uh there is kind of an 73:31 optimum 73:33 um for looking at the number of 73:35 neighbors which allows you to actually 73:37 capture 73:39 um the the 73:42 circular topology the hole in the middle 73:44 right but 73:46 if you if like if you go too big you'd 73:48 kind of lose 73:49 to start to lose the information about 73:52 the order of these dots 73:54 of these uh measurements right they 73:57 start to get more and more noisy 73:59 if you look at less neighbors 74:03 you get nice ordered lines but you start 74:06 to miss 74:08 things where you have these gaps and you 74:10 start to create clusters 74:11 where there shouldn't be clusters so 74:14 this is the 74:15 most important 74:17 [Music] 74:19 parameter of the of the method 74:22 i'd say there is like if you if you uh 74:27 the umap home page the like from the 74:29 umap algorithm 74:31 they uh feature a nice section of what 74:33 parameters do you have for your map and 74:36 and uh 74:37 what influence do they have but it's 74:40 it is generally relatively robust also 74:44 compared to other methods 74:45 if you go for typical like 74:49 it just but this is just empirical kind 74:51 of if you go with 74:52 about 15 to 30 neighbors 74:56 or something you usually derive good 74:59 representations 75:00 of the topology of your data if you have 75:02 enough data points 75:04 um so it's relatively robust uh to the 75:08 choice of 75:08 of parameters you don't have to play 75:10 around with it too much 75:14 thank you 75:19 okay perfect very nice do we have any 75:21 other questions 75:22 yeah there's a question from kai muller 75:25 from the chat which i will read 75:26 which is uh he asked for high 75:28 dimensional data how does human 75:30 umap determine the manifold dimension 75:36 um 75:38 i don't think that it does 75:41 and 75:47 i don't think that it does and this 75:51 might be related to that it only 75:54 in the u-map algorithm what you do is 75:56 when you when you construct 75:59 this uh the graph representation 76:02 um from the intersecting 76:06 spheres you only look at the 76:10 one simplicity so you only construct a 76:12 graph and you don't look at 76:15 [Music] 76:18 at higher order simplices i'm not sure 76:20 what this could be used to 76:22 to say something about the 76:23 dimensionality 76:26 um of your manifold i'm i'm unsure there 76:31 but you might by default it doesn't tell 76:33 you the dimensionality 76:40 okay um any other question a lot of 76:44 questions today 76:55 okay so if there are no more questions 76:57 you know then uh 76:58 let's end the meeting you know thanks 77:00 all for joining and thanks again for 77:02 fabian thank you well thank you for 77:05 having me 77:06 yeah thank you okay perfect 77:10 yeah see you all next week bye 77:16 you

本文收录于以下合集：