hansvm 2 days ago

It's a neat idea. It's not too dissimilar in spirit from gradient boosting. The point about credit assignment is crucial, and that's the same reason most architectures and initialization methods are so convoluted nowadays.

I don't really like one of their premises and conclusions:

> that does not learn hierarchical representations

There's an implicit bias here that (a) traditional networks do learn hierarchial representations, (b) that's bad, and (c) this training method does not learn those. However, [a] is situational, and it's easy to construct datasets where a standard gradient-descent neural net will learn a different way, even with a reverse hierarchy. [b] is unproven and also doesn't make a lot of intuitive sense to me. [c], even in this paper where they make that claim, has no evidence and also doesn't seem likely to be true.

gwern 2 days ago

https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...

I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?

  • ActorNightly a day ago

    I think we need to start thinking about one shot training. I.e instead of context into LLM, you should be able to tell it a fact, and it will encode that fact into the updated weights.

  • cttet 2 days ago

    In all their experiments, backprop is used for most of their parameter though...

    • hansvm 2 days ago

      There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.

      It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.

toxik a day ago

An interesting idea for sure, but why only evaluate it on 28x28 pixel images? Why is their flow matching so much worse in some cases? Missing some analysis. Their words on it say nothing:

> For CIFAR-100 with one-hot embeddings, NoProp-FM fails to learn effectively, resulting in very slow accuracy improvement

In general any actual analysis is made impossible because of the lack of signal in the results. Fig 5 tells me nothing when the span is 99.58 to 99.46 percent accuracy.

DrFalkyn 2 days ago

If we could ever figure out what wet brains actually do (continuous feedback, ? enzyme release ? ) this might be possible

  • friendzis a day ago

    We know quite a lot. For example, we know that brains have various different nueromodulatory pathways. Take for example dopamine reward mechanism that is being talked about more openly these day. Dopamine is literally secreted by various different parts of the brain and affect different pathways.

    I don't think it is anywhere feasible to emulate anything resembling this in a computational neural network with fixed input and output neurons.

    • dgfl a day ago

      Dopamine is not permanent, though. We're talking about long-term synaptic plasticity, not short-term neurotransmitter modulation.

      • scarmig a day ago

        Dopamine modulates long term potentiation and depression, in some complicated way.

    • idiotsecant a day ago

      Aren't we already emulating it? It's sort of a distributed and overlaid reward function, which we just undistributed

  • tsimionescu a day ago

    Keep in mind that our brains also have a great deal of built in trained structure from evolution. So even if we understood exactly how a brain learns, we may still not be able to replicate it if we can't figure out the highly optimized initial state from which it starts in a fetus.

    • gpjt a day ago

      Presumably that is limited by the gig or so of information in our DNA, though?

      • tsimionescu a day ago

        The amount of information transmitted from one generation to the next is potentially much more than the contents of DNA. DNA is not an encoding of every detail of a living body, it is a set of instructions for a living body to create an approximate copy of itself. You can't use DNA, as far as we know, to create a new organism from scratch to create a new organism without having the parent organism around to build it. We do know for certain that many parts of a cell divide separately from the nucleus and have no relation to the DNA of the cell - most well known being the mitochondria, which have their own DNA, but also many organelles just split off and migrate to the new cell quasi-independently. And this is just the simplest layer in some of the simplest organisms - we have no idea whatsoever how much other information is transmitted from the parent organism to the child in ways other than DNA.

        In particular in mammals, we have no idea how actively the mother's body helps shape the child. Of course, there's no direct neuron to neuron contact, but that doesn't mean that the mother's body can't contribute to aspects of even the fetal brain development in other ways.

        • gpjt 18 hours ago

          Interesting. As you say, that certainly makes sense for mammala. But I'd be interested in knowing what mechanisms you might conjecture for birds, where pretty much all foetal development happens inside the egg, separated from the mother -- or fish, or octopuses.

    • friendzis a day ago

      I concur. It might not be feasible in terms of computational power available, but I don't think there is anything fundamentally stopping application of those training mechanisms, unless the whole neuralnet paradigm is fundamentally incompatible with those learning methods.

    • jampekka a day ago

      How much of, especially "higher level cognition" like language, is encoded genetically is highly controversial and the thinking/pendulum in last decade or two has shifted substantially towards only general mechanisms being innate. E.g. the cortex may be in an essentially "random state" prior to getting input.

      • deepsun a day ago

        Yet for example the auditory/language processing part is almost always located in the same region for all humans.

        • jampekka a day ago

          E.g. ear input is connected to the same cortical location in almost all humans.

      • tsimionescu a day ago

        That's why I qualified all of my statements with "may" and "might". Still, I think it's extraordinarily unlikely that human brains could turn out, for example, to have no special bias for learning language. The training algorithm in our brains would have to be soany orders of magnitude better than the state of the art in ANNs that it would boggle the mind.

        Consider the comparison with LLM training. A state of the art LLM that is, say, only an order of magnitude better than an average 4 year old human child in language use is trained on ~all of the human text ever produced, consuming many megawatts of power in the process. And it's helped with plenty of pre-processing of this text information, and receives virtually no noise.

        In contrast, a human child that is not deaf acquires language from a noisy enviroment with plenty of auditory stimuli from which they first have to even understand that they are picking up language. To be able to communicate and thus receive significant feedback on the learning, they also have to learn how to control a very complex set of organs (tongue, lips, larynx, chest muscles), all with many degrees of freedom and precise timing needed to produce any sound whatsoever.

        And yet virtually all human children learn all of this in a matter of 12-24 months, consuming, say, and then spend another 2-3 years learning more language without struggling as much with the basics of word recognition and pronunciation. And they do all this while consuming a total of some 5kWh, and this includes many bodily processes that are not directly related to language acquisition, and a lot of direct physical activity too.

        So, either we are missing something extremely fundamental, or the initial state of the brain is very, very far from random and much of this was actually trained over tens or hundreds of thousands of years of evolution of the hominids.

        • jampekka a day ago

          Language capability is a bit difficult to quantify, but LLMs know tens of languages, and many of those better than at least the vast majority of even native humans at least grammar- and vocabulary-wise. They also encode magnitudes more fact-type knowledge than any human being. My take is that language isn't that hard but humans just kinda suck at it, like we suck at arithmetic and chess.

          There sure is some "inductive bias" in the anatomy of the brain to develop things like language but it could be closer to how transformer architectures differ from pure MLPs.

          The argument was for decades that no generic system can learn language from input alone. That turned out flat wrong.

    • stormfather a day ago

      Didn't they get neurons in a petri dish to fly a flight simulator?

erikerikson 2 days ago

We have gradient free algorithms: Hebbian learning. Since 1949?

  • uoaei 2 days ago

    And there's good reasons why we use gradients today.

  • sva_ a day ago

    That's more a theory/principle, not an algorithm by itself.

itsthecourier 2 days ago

"Whenever these kind of papers come out I skim it looking for where they actually do backprop.

Check the pseudo code of their algorithms.

"Update using gradient based optimizations""

  • f_devd 2 days ago

    I mean the only claim is no propagation, you always need a gradient of sorts to update parameters. Unless you just stumble upon the desired parameters. Even genetic algorithms effectively has gradients which are obfuscated through random projections.

    • erikerikson 2 days ago

      No you don't. See Hebbian learning (neurons that fire together wire together). Bonus: it is one of the biologically plausible options.

      Maybe you have a way of seeing it differently so that this looks like a gradient? Gradient keys my brain into a desired outcome expressed as an expectation function.

      • srean a day ago

        Nope that update with the rank one update is exactly the projected gradient of the reconstruction loss. That's not the way it is usually taught. So Hebbian learning was an unfortunate example.

        Gradient descent is only one way of searching for a minima, so in that sense it is not necessary, for example, when one can analytically solve for the extrema of the loss. As an alternative one could do Monte Carlo search instead of gradient descent. For a convex loss that would be less efficient of course.

      • red75prime 2 days ago

        > See Hebbian learning

        The one that is not used, because it's inherently unstable?

        Learning using locally accessible information is an interesting approach, but it needs to be more complex than "fire together, wire together". And then you might have propagation of information that allows to approximate gradients locally.

        • erikerikson 2 days ago

          Is that what they're teaching now? Originally it was not used because it was believed it couldn't learn XOR (it can [just not as perceptrons were defined]).

          Is there anyone in particular whose work focuses on this that you know of?

      • yobbo a day ago

        If there is a weight update, there is a gradient, and a loss objective. You might not write them down explicitly.

        I can't recall exactly what the Hebbian update is, but something tells me it minimises the "reconstruction loss", and effectively learns the PCA matrix.

        • erikerikson a day ago

          > loss objective

          There is no prediction or desired output, certainly explicit. I was playing with those things in my work to try and understand how our brains cause the emergence of intelligence rather than solve some classification or related problem. What I managed to replicate was the learning of XOR by some nodes and further that multidimensional XORs up to the number of inputs could be learned.

          Perhaps you can say that PCAish is the implicit objective/result but I still reject that there is any conceptual notion of what a node "should" output even if iteratively applying the learning rule leads us there.

        • orbifold a day ago

          Not every vector field has a potential. So not every weight update can be written as a gradient.

      • HarHarVeryFunny a day ago

        Even with Hebbian learning, isn't there a synapse strength? If so, then you at least need a direction (+/-) if not a specific gradient value.

        • erikerikson a day ago

          Yes there is a weight on every connection. At least when I was at it gradients were talked about in reference to the solution space (e.g. gradient descent). The implication is that there is some notion of what is "correct"for some neutron to have output and then we bend it to our will by updating the weight. In Hebbian learning there isn't a notion of correct activation, just a calculation over the local environment.

    • bob1029 a day ago

      In genetic algorithms, any gradient found would be implied by way of the fitness function and would not be something to inherently pursue. There are no free lunches like with chain rule of calculus.

      GP is essentially isomorphic with beam search where the population is the beam. It is a fancy search algorithm. It is not "training" anything.

      • f_devd a day ago

        True, genetic algorithms are only implied, but those implied gradients are used in the more successful evolutionary strategies. So while they might not look like it (because it's not used in a continuous descent) they still very much work like (although they represent a smoother function than) regular back-prop gradients when aggregated.

    • gsf_emergency_2 a day ago

      GP glancing at the pseudo-code is certainly an efficient way to dismiss an article, but something tells me he missed the crucial sentence in the abstract:

      >"We believe this work takes a first step TOWARDS introducing a new family of GRADIENT-FREE learning methods"

      I.e. for the time being, authors can't convince themselves not to take advantage of efficient hw for taking gradients

      (*Checks that Oxford University is not under sanctions*)

  • scarmig a day ago

    Check out feedback alignment. You provide feedback with a random static linear transformation of the loss to earlier layers, and they eventually align with the feedback matrix to enable learning.

    It's certifiably insane that it works at all. And not even vaguely backprop, though if you really wanted to stretch the definition I guess you could say that the feedforward layers align to take advantage of a synthetic gradient in a way that approximates backprop.

  • arrakark 2 days ago

    Same.

    If I had to guess it's just local gradients, not an end-to-end gradient.

itsthecourier 2 days ago

"Years of works of the genetic algorithms community came to the conclusion that if you can compute a gradient then you should use it in a way or another.

If you go for toy experiments you can brute force the optimization. Is it efficient, hell no."

sriku 2 days ago

Posted on 31st March which would've been 1st April somewhere else in the world?