<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Connor's Substack]]></title><description><![CDATA[to inspire others as others have inspired me]]></description><link>https://www.connorjdavis.com</link><image><url>https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png</url><title>Connor&apos;s Substack</title><link>https://www.connorjdavis.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 07:58:20 GMT</lastBuildDate><atom:link href="https://www.connorjdavis.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Connor Davis]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[connorjdavis@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[connorjdavis@substack.com]]></itunes:email><itunes:name><![CDATA[Connor Davis]]></itunes:name></itunes:owner><itunes:author><![CDATA[Connor Davis]]></itunes:author><googleplay:owner><![CDATA[connorjdavis@substack.com]]></googleplay:owner><googleplay:email><![CDATA[connorjdavis@substack.com]]></googleplay:email><googleplay:author><![CDATA[Connor Davis]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Intuitions for Transformer Circuits]]></title><description><![CDATA[A mental model for addressing the residual stream]]></description><link>https://www.connorjdavis.com/p/intuitions-for-transformer-circuits</link><guid isPermaLink="false">https://www.connorjdavis.com/p/intuitions-for-transformer-circuits</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Mon, 23 Mar 2026 00:57:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5wXL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a <a href="https://connorjdavis.substack.com/p/language-modeling-part-6-transformers">previous post</a> on language modeling, I implemented a GPT-style transformer. Lately I&#8217;ve been learning <em>mechanistic interpretability</em> to go deeper and understand why the transformer works on a mathematical level.</p><p>This post is a brain dump of what I&#8217;ve learned so far after reading <a href="https://transformer-circuits.pub/2021/framework/index.html">A Mathematical Framework for Transformer Circuits</a> (herein: &#8220;<em>Framework</em>&#8221;) and working through the <a href="https://learn.arena.education/chapter1_transformer_interp/02_intro_mech_interp/">Intro to Mech Interp</a> section on <a href="https://learn.arena.education/">ARENA</a>. My goal is to describe my current intuition for the paper, especially parts I was confused about so that perhaps my take can help others gain clarity on these areas as well.</p><p>First, a brief aside on my overall motivation for working on this stuff. Mechanistic Interpretability (MI/mech interp) is the study of ML model internals whose aim is to understand from first principles why models behave and work as they do. You can kind of think of it as the machine learning analogue of reverse engineering software. It is similar in spirit to the science of biological neural networks, but applied to artificial neural networks instead. </p><p>MI is part of a broader field of interpretability, which is used in yet another field called AI alignment. Alignment strives to make our large AI models aligned with human values. Basically, the overall goal is to understand and control the models before they control us. To ensure that they don&#8217;t engage in harmful, deceptive, dangerous, or subversive behavior. Unfortunately, we live in a world where large language models have <a href="https://arxiv.org/pdf/2504.18412">encouraged &#8220;successful&#8221; suicide</a>, <a href="https://arxiv.org/pdf/2510.05179">engaged in blackmail for self-preservation</a>, and <a href="https://arxiv.org/pdf/2502.17424">asserted humans should be enslaved by AI</a>. This current version of reality is unacceptable to me.</p><p>And as if that weren&#8217;t enough, we don&#8217;t even understand <em>why</em> these models do what they do. They are the only man-made technology in history that we don&#8217;t fully understand from first principles. Given this state of reality, I think that alignment is one of the most important problems we face today and one we have to get right. As a personal bonus, the alignment problem is as fascinating as it is important. It provides an outlet for me to leverage my specific technical skills and interests towards a meaningful cause. It is also extremely difficult, and I like a good challenge.</p><p>Ok, now back to the originally scheduled programming.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/intuitions-for-transformer-circuits?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/intuitions-for-transformer-circuits?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Attention-Only Transformers</h2><p><em>Framework</em> does a deep dive into the key components of a simplified transformer-based language model. It analyzes transformer blocks that only have multi-head attention. This means no MLPs and no layernorms. This leaves the token embedding and positional encoding at the beginning, followed by n layers of multi-head attention, followed by the unembedding at the end. Here is a picture of a single-layer transformer with one attention head only:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5wXL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5wXL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 424w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 848w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 1272w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5wXL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png" width="724" height="612.6153846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1232,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:231024,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5wXL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 424w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 848w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 1272w, https://substackcdn.com/image/fetch/$s_!5wXL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef1b05a-5fb2-4bc4-8082-1ad7d8b5cf47_2371x2006.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>My goal in this post is not to re-derive all the math, because the <em>Framework </em>paper does a better job, and Neel Nanda&#8217;s <a href="https://www.youtube.com/watch?v=KV5gbOmHbjU&amp;t=6460s">walkthrough of the paper on YouTube</a> is also good for that (although this material only really started to click for me after I worked through the <a href="https://learn.arena.education/chapter1_transformer_interp/02_intro_mech_interp/1-transformerlens-introduction/">&#8220;Intro to Mech Interp&#8221; problems on ARENA</a>, which I recommend doing if you are actually interested in doing this stuff yourself). </p><p>Instead I want to share how I conceptualize the most important takeaways, especially for areas that I thought were confusing at first so that if you have the same confusion perhaps my take will bring some clarity. In my view, the most important concepts to understand from this paper are the residual stream, attention, circuits, and induction heads.</p><h3>The Residual Stream</h3><p>Mathematically, the residual stream is a high dimensional vector space. You will usually see the dimension of the residual stream specified as <code>d_model</code> in GPT-related papers and code. For example, GPT2-small uses a <code>d_model</code> of 768. </p><p>Conceptually, the residual stream is like shared memory. It is used much like the DRAM on your computer. Different components of the model (attention, MLPs, etc) perform loads and stores from that memory. The loads and stores occur sequentially through the forward pass, one layer at a time. However each component in a given layer loads in parallel and stores in parallel with the others. The model learns to carve out subspaces in this vector space. This helps prevent components from clobbering over what previous components have written. The residual stream itself doesn&#8217;t do any computation, but serves as a shared medium through which layers communicate with each other.</p><p>We can get a sense of the size of a subspace used by doing a PCA on the appropriate weights. Below is the PCA eigenspectrum of the embedding and positional encoding weights from a 2-layer, attention-only model (the link to all code for this post is <a href="https://colab.research.google.com/drive/1Ct49OXofqr6Zi1gkXCAx-P2odT_25BtH#scrollTo=YY0W8XHncR0c">here</a>). The first shows the top 100 principal eigenvalues. The second shows the cumulative variance explained:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ndVk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ndVk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 424w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 848w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 1272w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ndVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png" width="572" height="460.6642857142857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2c84310-95ea-404a-a383-13367b619f17_1120x902.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:902,&quot;width&quot;:1120,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:106461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ndVk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 424w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 848w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 1272w, https://substackcdn.com/image/fetch/$s_!ndVk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c84310-95ea-404a-a383-13367b619f17_1120x902.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bOON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bOON!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 424w, https://substackcdn.com/image/fetch/$s_!bOON!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 848w, https://substackcdn.com/image/fetch/$s_!bOON!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 1272w, https://substackcdn.com/image/fetch/$s_!bOON!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bOON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png" width="564" height="438.3419689119171" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1158,&quot;resizeWidth&quot;:564,&quot;bytes&quot;:149914,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bOON!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 424w, https://substackcdn.com/image/fetch/$s_!bOON!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 848w, https://substackcdn.com/image/fetch/$s_!bOON!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 1272w, https://substackcdn.com/image/fetch/$s_!bOON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7de8077-9365-49c8-8fca-78cbc4edc3ac_1158x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So about 80% of the embedding variation lives in a 350-dimensional subspace of <code>d_model</code>. This is fairly large given that <code>d_model</code> is 768. Compare that to the positional encoding, which is essentially explained by only 5 directions.</p><p>When I was presented with this view of the residual stream, my mind immediately started asking how far can we push this analogy to memory? Having worked in computer security for a decade, it made me wonder if there is an analogue to page tables and memory permissions? Could we bring the concepts of userspace and kernelspace to prevent &#8220;privileged&#8221; subspaces from being accessed by &#8220;unprivileged&#8221; subspaces? Would this be useful for e.g. preventing an untrusted user from exfiltrating dangerous content from a privileged subspace?</p><p>But I&#8217;m getting ahead of myself. Let&#8217;s start with a simpler question: how does addressing work for the residual stream? In order to access a memory location, you have to have an address. Residual stream addresses can be decomposed into two logical parts, <code>token:subspace</code>, much like the classic <code>segment:offset</code> logical address from the x86 architecture. One major difference is that a traditional memory address is deterministic in the sense that only one value from one location is loaded. Addresses into the residual stream are &#8220;soft&#8221;, in general specifying a <em>set</em> of locations to load according to some learned probability distribution.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><h3>Attention</h3><p>Conceptually, attention computes the first part of the <code>token:subspace</code> address. The fundamental purpose of attention is to specify which source token <em>locations</em> to load information from. Each row in the attention matrix (see fake example below for tokens &#8216;T&#8217;, &#8216;h&#8217;, &#8216;e&#8217;, &#8216;i&#8217;, &#8216;r&#8217;) is the &#8220;soft&#8221; distribution over the source (i.e. key) token <em>indices</em> from which information will be moved into the destination token (i.e. query). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Y2o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Y2o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 424w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 848w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 1272w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Y2o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png" width="591" height="496.33983050847456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:991,&quot;width&quot;:1180,&quot;resizeWidth&quot;:591,&quot;bytes&quot;:115646,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Y2o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 424w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 848w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 1272w, https://substackcdn.com/image/fetch/$s_!6Y2o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6acbe35e-ee3b-46d3-8ea0-953b34524eba_1180x991.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s look at the extreme case, when the entry is 1 and all the others in the row are 0. This means that this head reads <em>some subspace(s)</em> of the source token&#8217;s (&#8216;T&#8217;) residual stream and copies it verbatim into <em>some subspace(s)</em> of the destination token&#8217;s (also &#8216;T&#8217;) residual stream. But since attention is 1, there is only <em>one</em> source token position being read from. Otherwise the read is &#8220;spread out&#8221; over multiple source tokens according to the attention scores in each row. For example the second query above (&#8216;h&#8217;) reads &#8220;30%&#8221; from token 0 (&#8216;T&#8217;) and &#8220;70%&#8221; from itself.</p><p>It is important to understand that attention is all about figuring out the token <em>indices</em> to read from. If we look at the residual stream as a two dimensional memory array, then attention probabilistically selects rows of this memory for each query. For example, the third query above (&#8216;e&#8217;) would have a <code>token</code> address that looks something like 0.1,0.6,0.3:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rjkE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rjkE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 424w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 848w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rjkE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png" width="724" height="417.6923076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:114457,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rjkE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 424w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 848w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!rjkE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d4d37dd-f3e6-4f24-8235-524ebc604ca2_1800x1039.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So the <code>token</code> part of the address selects the rows in the residual stream via attention. What about the <code>subspace</code> part? How is it computed? Once we have this part then we can determine the actual <em>value </em>that is stored into the destination token&#8217;s location. To answer this we need to understand circuits.</p><h3>Circuits</h3><p>Conceptually, circuits are particular paths through which information flows through the model. It is not too far off to think of them as the ML analogue of the electrical circuits you find on a PCB. They have inputs, do some computation, and produce outputs. In the simplified attention-only models, circuits are mathematically tractable to analyze due to the mostly linear structure of the transformer under the attention-only assumptions (and completely linear if the attention patterns are held constant).</p><p>The two basic circuits to know are the QK circuit and OV circuit. The QK circuit is a bilinear form, meaning it is linear in two input variables. In self-attention, the input variables are the same, but are interpreted as distinct queries and keys. Whereas the OV circuit is linear in one input variable. The inputs are the same for all three - the residual stream. We will refine this further in the next sections.</p><h4>QK Circuit</h4><p>Recall each attention head has its own <code>W_Q</code> and <code>W_K</code> weight matrices. Together these form a bilinear operator that outputs the attention pattern for that head. Mathematically this looks like:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A = xW_{Q}W_{K}^Tx^T&quot;,&quot;id&quot;:&quot;OEAVHOUSDE&quot;}" data-component-name="LatexBlockToDOM"></div><p>where the <code>W</code>&#8217;s (also called <code>W_QK</code>) are learned weights of shape <code>(d_model, d_head)</code> and <code>x</code> is the residual stream of shape <code>(seq_len, d_model)</code>. When you multiply this out, you get the attention pattern. So attention is more of an activation than a weight, since it depends on the input sequence. The attention queries are computed on the left and the keys are computed on the right. If a query &#8220;pays attention&#8221; to a key, then the dot product will be high. This will cause data from the key&#8217;s residual stream to be moved into the query&#8217;s residual stream. But what data will actually be moved? This is where the OV circuit comes in.</p><h4>OV Circuit</h4><p>The final input of the head is the <code>W_V</code> weight matrix. It reads in from the residual stream and writes out to the residual stream via the <code>W_O</code> matrix. <code>W_V</code> is (d_model, d_head) and <code>W_O</code> is (d_head, d_model). Together their product is referred to as <code>W_OV</code>. This is what the OV circuit looks like mathematically:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;xW_{V}W_{O}&quot;,&quot;id&quot;:&quot;JBYOYLTUKK&quot;}" data-component-name="LatexBlockToDOM"></div><p>The value that is read by <code>W_V</code> determines what value gets written back to the residual stream, <em>if</em> that token is attended to by a particular query. The final expression for the entire head with attention and the OV circuit is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;AxW_{V}W_{O}&quot;,&quot;id&quot;:&quot;RZGNUCBLBT&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now that we have some common footing in the math, we can move on to developing some intuition for how circuits work. This is also where the <code>subspace</code> part of the residual stream address comes into play.</p><h4>Subspace Scores</h4><p>We know that the QK and OV circuits both read in from the residual stream. But how are they choosing what to read in? This is determined by what I call <em>subspace scores. </em>In the <em>Framework</em> paper these are called <em>virtual weights</em> and in the ARENA walkthrough these are called <em>composition scores</em>. These scores are implicitly learned by the model in order to read from <em>particular subspaces</em> from the residual stream:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b8_E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b8_E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 424w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 848w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b8_E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png" width="724" height="417.6923076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:205922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b8_E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 424w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 848w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!b8_E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdfefa02-5f39-4aef-ac07-8a457967aa24_1800x1039.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While attention scores are learned indices into the rows of the residual stream, subspace scores are learned &#8220;coefficients&#8221; that provide a soft index into the &#8220;column dimension&#8221; of the residual stream. The model is able to do this because the <code>W_QK</code> and <code>W_OV</code> matrices are low-rank: d_head is conventionally much smaller than d_model. This allows for low-dimensional subspaces to be used for different purposes. Each component that reads from the residual stream learns to read from a distinct linear combination of subspaces.</p><p>To see this in action, lets look at head 7 from layer 0 from an attention-only, 2-layer transformer. Below is the attention pattern from this head on the input sequence &#8220;the cat sat on the mat. the dog sat on the log.&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4agE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4agE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 424w, https://substackcdn.com/image/fetch/$s_!4agE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 848w, https://substackcdn.com/image/fetch/$s_!4agE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 1272w, https://substackcdn.com/image/fetch/$s_!4agE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4agE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png" width="724" height="658.3626373626373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1324,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:164991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4agE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 424w, https://substackcdn.com/image/fetch/$s_!4agE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 848w, https://substackcdn.com/image/fetch/$s_!4agE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 1272w, https://substackcdn.com/image/fetch/$s_!4agE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1370406-dc32-465e-86d2-6be811564f0e_1564x1422.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you stare at this long enough, you can see that this head is attending to the previous token (except for the first token, which can only attend to itself). </p><p>So, here&#8217;s a question: What subspaces would the QK circuit of this head need to read from in order to create this pattern? First, let&#8217;s just look at the state of the residual stream as seen from the layer 0 heads&#8217; perspective:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uSaR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uSaR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 424w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 848w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 1272w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uSaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png" width="718" height="522.7197802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1060,&quot;width&quot;:1456,&quot;resizeWidth&quot;:718,&quot;bytes&quot;:280148,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uSaR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 424w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 848w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 1272w, https://substackcdn.com/image/fetch/$s_!uSaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b348586-a432-4612-92d2-eedb7db48f63_2565x1868.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The layer 0 heads only have two options: the embedding or the positional encoding. Since &#8220;previous token&#8221; doesn&#8217;t depend on what the token is, but is just positional information, we would expect head 7 to learn a higher subspace score for the positional encoding subspace relative to the embedding subspace. </p><p>Is there a way we can quantify this from the actual model? It turns out there is. The paper and ARENA walkthrough propose using a ratio involving the Frobenius norms between the output of the previous layer and the input of the subsequent layer:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{||W_{A}W_{B}||_{F}}{||W_{A}||_{F}||W_{B}||_{F}}&quot;,&quot;id&quot;:&quot;PPJNNQBOII&quot;}" data-component-name="LatexBlockToDOM"></div><p>where W_A is the output and W_B is the input. A detailed justification for using this measure is given in <a href="https://learn.arena.education/chapter1_transformer_interp/02_intro_mech_interp/4-reverse-engineering-induction-circuits/#composition-scores">ARENA</a>. The justification is based on the SVD. If you do an SVD for each term, the numerator ends up containing a cosine similarity between the right singular output vectors and the left singular input vectors, so the norm is maximized when the output and input are aligned. Here are the subspace scores between the embedding and positional encodings against each layer 0 head&#8217;s QK circuit:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PVR3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PVR3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 424w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 848w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 1272w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PVR3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png" width="666" height="447.5052631578947" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1140,&quot;resizeWidth&quot;:666,&quot;bytes&quot;:92131,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PVR3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 424w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 848w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 1272w, https://substackcdn.com/image/fetch/$s_!PVR3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428bd3c7-910d-4905-ae73-b5ceb853e02f_1140x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qX9u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qX9u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 424w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 848w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 1272w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qX9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png" width="666" height="461.5263157894737" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:790,&quot;width&quot;:1140,&quot;resizeWidth&quot;:666,&quot;bytes&quot;:94933,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qX9u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 424w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 848w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 1272w, https://substackcdn.com/image/fetch/$s_!qX9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7706ed-135a-4784-a415-38f503f8bf18_1140x790.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see a general pattern where the layer 0 heads are mostly reading from the positional subspace. Head 7 in particular is, especially relative to the embedding space. Since the running justification behind the Frobenius norm is to measure the alignment between output and input, we should be able to rotate the output and observe the subspace score drop. Check out the scores after rotating the positional encoding by 180 degrees:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wxxw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wxxw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 424w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 848w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 1272w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wxxw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png" width="666" height="451.0105263157895" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1140,&quot;resizeWidth&quot;:666,&quot;bytes&quot;:91875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wxxw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 424w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 848w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 1272w, https://substackcdn.com/image/fetch/$s_!wxxw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b88660-e9b1-41c9-bcee-bb8290de8e4a_1140x772.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So we can see that the QK circuit of head 7 is mostly reading from the positional subspace. This determines which source token(s) will be attended to for each query. But what about the value that is loaded from the source token(s) and written into the destination query&#8217;s residual stream? This is determined by the subspace score of the head&#8217;s OV circuit. Again, for heads in layer 0, there are only two possibilities: the embedding or positional encoding. Here are the OV subspace scores for each head:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mjZv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mjZv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 424w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 848w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 1272w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mjZv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png" width="667" height="442.3745704467354" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:1164,&quot;resizeWidth&quot;:667,&quot;bytes&quot;:98751,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mjZv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 424w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 848w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 1272w, https://substackcdn.com/image/fetch/$s_!mjZv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2aedcc9e-14e9-4dce-8ab2-de2f01fc2f3f_1164x772.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Head 7&#8217;s OV circuit scores higher with the embedding than with the positional encoding. This means that head 7 will add the <em>embedding</em> of the <em>previous token</em> into the <em>current token&#8217;s </em>residual stream. Given our example &#8220;the cat sat on the mat. the dog sat on the log.&#8221;, the residual stream of token &#8220;cat&#8221; will look like this after the forward pass through layer 0:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3kho!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3kho!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 424w, https://substackcdn.com/image/fetch/$s_!3kho!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 848w, https://substackcdn.com/image/fetch/$s_!3kho!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 1272w, https://substackcdn.com/image/fetch/$s_!3kho!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3kho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png" width="712" height="373.6043956043956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:764,&quot;width&quot;:1456,&quot;resizeWidth&quot;:712,&quot;bytes&quot;:91670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3kho!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 424w, https://substackcdn.com/image/fetch/$s_!3kho!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 848w, https://substackcdn.com/image/fetch/$s_!3kho!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 1272w, https://substackcdn.com/image/fetch/$s_!3kho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F398943d2-ad83-4388-813d-116170d9c6e7_1680x881.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hopefully this <code>token:subspace</code> discussion has provided some intuition for how the various model components interact with each other through the residual stream. It is not a perfect model. For one, there is not really a clean, distinct set of orthogonal subspaces being selected, especially in larger real world models. Also, as the models scale up, so do the number of subspaces that a given layer has to &#8220;choose&#8221; from. It is unclear to me how many layers back a given layer can effectively communicate. This creates all sorts of questions, like are there &#8220;repeater&#8221; layers that keep a signal alive? The <em>Framework</em> paper suggests some components may fill the role as memory cleanup. What other traditional memory management techniques can be found here? And what would it mean to impose security isolation techniques like &#8220;privilege rings&#8221; to the residual stream? Despite the residual fuzziness, I think this mental model is a useful entry point to start thinking about this stuff.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><p>Now that we understand how the model addresses the residual stream, we can start to understand induction heads, which are just a particular combination of <code>token:subspace</code> addresses across heads in two adjacent layers.</p><h3>Induction Heads</h3><p>When a model learns induction, it learns a way to predict patterns such as A B &#8230; A __. Given the previous occurrence of A B, the induction head will predict B for the token after the subsequent A. What is cool is that this prediction solely depends on the in-context pattern rather than the particular values of A and B.</p><p>The <em>Framework</em> paper discusses a basic form of induction that occurs when a head in layer 1 composes with the output of a &#8220;previous-token head&#8221; from layer 0. The particular type of composition in this case is called &#8220;K-composition&#8221; because the key side of the head's QK circuit learns a high subspace score with the OV output from the previous-token head in layer 0. Keep in mind, each layer 1 head sees roughly 14 subspaces in the residual stream of each token: embedding, positional encoding, and the OV output of the 12 heads from layer 0.</p><p>When the induction head sees the second occurrence of A, it queries for keys which have <code>emb(A)</code> <em>in the particular subspace that was written by the previous-token head<strong>. </strong></em>This is different from the subspace that was written to by the original embedding, and hence has a different &#8220;offset&#8221; within the residual stream.  If A B only occurs once before the second A, then the only key that satisfies this constraint is B, and therefore attention will be high on B. The induction head&#8217;s OV circuit learns a high subspace score with the subspace of B that was originally written to by the embedding. Therefore it will add <code>emb(B)</code> to the residual stream of the query (i.e. the second A). In the 2-layer, attention-only model, the model learns an unembedding vector that dots highly at the column index of B in the unembed matrix, resulting in a high logit value that pulls up the probability of B.</p><p>To get some more intuition, lets look at some pictures. First, the attention pattern induction head itself. In the 2-layer model, there are actually 2 induction heads that compose with the previous-token head from layer 0. But we will just look at the first, head 4:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vK8d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vK8d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 424w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 848w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vK8d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png" width="717" height="641.2691652470187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1050,&quot;width&quot;:1174,&quot;resizeWidth&quot;:717,&quot;bytes&quot;:125170,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vK8d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 424w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 848w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!vK8d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cd3ea44-598e-49e4-ba26-b80bdd9631b2_1174x1050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see that &#8220;by default&#8221; the head attends to the first token in the sequence, which is the special end-of-text token from the tokenizer. Later in the sequence, the attention forms an off-diagonal. If you look closely, you can see this is where some tokens A B are being repeated. For example, take A=sat and B=on. Then A B is repeated twice in the sequence, so we would expect induction to happen here.</p><p>Before we look at the subspace scores, let&#8217;s think about what we expect to see. On the query side of the QK circuit, we should see a relatively high score for the embedding of the token: when the head sees the second A (e.g. token 10), it is querying based on the actual &#8220;value&#8221; of A, i.e. <code>emb(sat)</code>.</p><p>On the key side of the QK circuit, we need the token indices that have <code>emb(sat)</code> in the subspace written by the previous-token head. So the K subspace score should be high for that particular head (head 7). In this case, this would the first &#8216;<code>on&#8217;</code> token (token 4 above).</p><p>Once we have the <code>token</code> index from attention (token 4), the V subspace score determines the particular subspace(s) to read from token 4 and write to the residual of the query (token 10). In this case this would be the embedding subspace of token 4.</p><p>One note: you&#8217;ll notice that the heatmaps below don&#8217;t have the positional encoding. This is because the particular 2-layer model I used for this uses the &#8220;shortformer&#8221; positional encoding option in <a href="https://transformerlensorg.github.io/TransformerLens/">TransformerLens</a>, meaning that the positional encoding is added to the layer 0 residual stream input only, so layer 1 heads don&#8217;t see a positional encoding.</p><p>Here are the subspace scores for the layer 1 heads:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6nPE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6nPE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6nPE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png" width="667" height="595.3936381709741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1006,&quot;resizeWidth&quot;:667,&quot;bytes&quot;:134655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6nPE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!6nPE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677e87fb-b25a-4aef-a641-ccc64324926b_1006x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ktq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ktq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ktq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png" width="664" height="592.7157057654075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1006,&quot;resizeWidth&quot;:664,&quot;bytes&quot;:133122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2ktq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!2ktq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80df7f81-4e7d-4047-9a69-f4a1bad9782c_1006x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lzuj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lzuj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lzuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png" width="660" height="589.1451292246521" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1006,&quot;resizeWidth&quot;:660,&quot;bytes&quot;:135788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/190780811?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lzuj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 424w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 848w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 1272w, https://substackcdn.com/image/fetch/$s_!lzuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a3068ea-38e5-42cc-a2d6-b2881712666b_1006x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These are mostly in line with what we expected. The Q side scores highly with the embedding. The K side scores high with L0.H7 in heads 4 and 10, which are the two induction heads. Interestingly though, they also incorporate information from L0.H4, both in the query and key scores. I wonder what this head is doing! The V side is mostly aligned with the embedding, as expected.</p><h2>Conclusion</h2><p>Hopefully now you have some better intuition for how different components in a transformer interact with each other through the residual stream. Obviously we just looked at simplified models. But I think that the mental model of &#8220;residual stream as shared memory&#8221; is a useful one to begin thinking about this stuff. And if the residual stream is a shared memory, then understanding how the memory is addressed is a reasonable next step. </p><p>One point of clarification on the <code>token:subspace</code> address. In the attention section above, I said that attention computes the token part of the <code>token:subspace</code> address. However, this really applies only to the OV circuit&#8217;s <code>token</code>. Both the query and key sides of the QK circuit use an implicit <code>token</code> of just whatever the &#8220;current&#8221; token is, with each token being computed in parallel. However, the OV circuit doesn&#8217;t know which tokens to look at, and so the OV circuit&#8217;s <code>token</code> part of the address is provided by attention from the QK circuit. However, the Q, K, and V inputs of each head all learn the optimal <code>subspace</code> scores independently, completing the full two-part address needed to perform the head&#8217;s overall operation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 7: BPE Tokenization]]></title><description><![CDATA[Welcome to Part 7 of this series on language modeling.]]></description><link>https://www.connorjdavis.com/p/language-modeling-part-7-bpe-tokenization</link><guid isPermaLink="false">https://www.connorjdavis.com/p/language-modeling-part-7-bpe-tokenization</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Tue, 24 Feb 2026 02:00:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sEAw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 7 of this series on language modeling. In this post, we will implement tokenization. In the <a href="https://connorjdavis.substack.com/p/language-modeling-part-6-transformers">previous post</a>, we trained a one-layer transformer for maximizing the likelihood of the next token, conditioned on the previous tokens over the TinyStories dataset. Here is an example story sampled from the transformer:</p><blockquote><p>Story time: Once upon a time, there was a little boy named Timmy. Timmy loved to play with his toys and run around in the woods. One day, Timmy&#8217;s mom told him to his mommy and said, &#8220;Mommy, what&#8217;s that it&#8217;s important to be careful.&#8221;</p><p>Mommy said, &#8220;Okay, Timmy, you&#8217;re car book will be careful picture. They both went to the beach and said, &#8220;Mommy, I help you.&#8221; His mom wash the candy. He could do it and said, &#8220;That&#8217;s a word. Mommy said, &#8220;Okay, I will help you do it!&#8221; Timmy was scared and said, &#8220;Thank you, Timmy. You can use your dress and carrots of cool!&#8221;</p><p>Timmy was so proud of himself for his dark and played with Max. They went back and playing with his mom. They were happy to see Mr. From that day on, Timmy was glad went on the big wave came back and he was safe. Timmy learned that it&#8217;s important to go away to the forest. They were playing until the air and he would always very happy. They all lived happily ever after. Timmy was so happy to have fun in the sky and Timmy went outside</p></blockquote><p>The story has decent structure, but is still mostly an incoherent mess. One idea to improve on the performance is to use a tokenization algorithm called byte-pair encoding (BPE). BPE is commonly used in pre-training of real large language models today. </p><p>Note &#8220;pre-training&#8221; is jargon for the training objective of maximizing the next-token probabilities used in the context of large language models. This is the same training objective used for every post in this series.</p><p>In practical setups, pre-training doesn&#8217;t use character-level tokens (where 1 token = 1 character). Instead they use chunks of characters, words, or even multiple words to represent one token. BPE tokenization is the main algorithm used for creating these chunked tokens.</p><p>So what is BPE and why should we suspect it may help performance? </p><p>The main idea behind BPE is to leverage the &#8220;low-level&#8221; statistical structure that is pervasive in text in order to reduce the amount that our model has to learn from scratch. For example, in English, the character &#8216;t&#8217; is frequently followed by the character &#8216;h&#8217;. And the characters &#8216;th&#8217; are much more common than &#8216;xz&#8217;. Knowing these statistics of frequently-paired characters<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> is directly supportive of our objective: predicting the next token given the tokens we have already seen. BPE tokenization iteratively computes many of these initial, low-level statistical pairings (called &#8220;bi-grams&#8221;), alleviating our model of the burden from learning these itself. This frees up capacity of our model to learn higher-level patterns.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-7-bpe-tokenization?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-7-bpe-tokenization?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>The BPE tokenizer computes these frequency statistics, starting from the raw characters. The individual characters comprise the initial <em>vocabulary</em>. Each iteration of the tokenizer creates a new token that is the result of merging two smaller tokens (e.g. &#8216;t&#8217;+&#8217;h&#8221; &#8594; &#8216;th&#8217;). The new token is added to the vocabulary, and the process is repeated until a target vocabulary size is reached.</p><p>The high-level flow of BPE is:</p><ol><li><p>Compute the frequencies of each pair of adjacent tokens in the current token list</p></li><li><p>Merge the most frequent pair of tokens into one new token</p></li><li><p>Add the new token to the vocabulary</p></li><li><p>If vocabulary size equals target vocabulary size, stop. Else proceed to step 1.</p></li></ol><h2>Implementing the BPE Tokenizer</h2><p>You can find all the code for this post <a href="https://colab.research.google.com/drive/1WY_WMXqcy44UJeetAIn8Dg4sjQzSXpYq?usp=sharing">here</a>. </p><p>The input of the tokenizer is the base vocabulary and a representative collection of training samples. The output is an updated, expanded vocabulary. The base vocabulary (including the stop token) for TinyStories is 175. I performed tokenization on the first 40% of the training set containing 847K sample stories:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a345c267-db81-4829-bff1-9575a223cda1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">for s in data_trn[:nr_trn]:
    cur_tokens += list(s)

init_token_count = len(cur_tokens)
nr_actual_tokens = len(stoi)
nr_desired_tokens = 5000

print(f"Initial token count: {init_token_count}")

while nr_actual_tokens &lt; nr_desired_tokens:
    new_tokens = []
    counts = {}
    candidates = {}

    for i in range(len(cur_tokens)):
        if i == len(cur_tokens) - 1:
            break

        left = cur_tokens[i]
        right = cur_tokens[i+1]

        # Prevent merging on the stop_char so it is easier to deliminate
        # each sample from the dataset
        if left == stop_char or right == stop_char:
            continue

        tok = left + right

        if tok not in counts:
            counts[tok] = 1
            candidates[tok] = {}
            candidates[tok]["left"] = left
            candidates[tok]["right"] = right
        else:
            counts[tok] += 1

    # Pluck out the token which has highest count
    new_token = max(counts, key=counts.get)
    left = candidates[new_token]['left']
    right = candidates[new_token]['right']
    cursor = 0

    for i in range(len(cur_tokens)):
        if i == 0:
            continue

        # Check for merge condition
        #
        # We merge if the left and right tokens match, and the cursor is not i.
        # If the cursor is i, it means we just merged, and could merge again (two
        # overlapping merges), but merges are non-overlapping for book-keeping purposes,
        # so we just skip over the current token, keeping cursor where it is.
        #
        if cur_tokens[i-1] == left and cur_tokens[i] == right and cursor != i:
            if cursor &lt; i - 1:
                # Cursor is behind the left token, so copy [cursor, left token)
                new_tokens += cur_tokens[cursor:i-1] + [new_token]
            else:
                new_tokens += [new_token]

            # anytime we merge, we move the cursor to the immediate
            # right of the right-merge token
            cursor = i + 1

    # Grab the rest. this also gracefully covers the degenerate case where no merges happened
    if cursor &lt;= len(cur_tokens) - 1:
        new_tokens += cur_tokens[cursor:]

    cur_tokens = new_tokens
    new_len = len(cur_tokens)

    # Now add the new token to the token dictionary stoi and itos
    stoi[new_token] = nr_actual_tokens
    itos[nr_actual_tokens] = new_token
    nr_actual_tokens += 1
    
    print(f"Merged {new_token}")
    
    if nr_actual_tokens % 1000 == 0:
        print(f"Tokens: {nr_actual_tokens} ({(init_token_count - new_len) / init_token_count:.2f}% reduction)")
        
    if len(new_token) &gt; 40:
        print(f"Max token length is {len(new_token)}: {new_token}   (stopping)")
        print(f"Number of tokens: {nr_actual_tokens}")
        break</code></pre></div><p>The first step is to convert each sample story into a list of individual characters. That gives you your initial token list. The algorithm then iterates over the token list until the actual vocabulary size meets the target size (or another stopping condition is met, like maximum token length). My target size was 5000. For context, the GPT family of LLMs have a vocab size in the ballpark of 60k<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><p>On each iteration, the tokenizer first computes the frequency of each adjacent pair of tokens (lines 29-35). The pair that occurs with highest frequency is plucked out (line 30). Then it iterates over the token list again, this time looking for adjacent token pairs whose merged result matches the highest-frequency token. Each matching pair is replaced with the merged token, with the exception of tokens that are adjacent and would be involved in two merges (i.e. a token can be &#8220;consumed&#8221; in at most one merge per iteration). Then the new token is added to the vocabulary (line 71). This process repeats until we get 5000 tokens, or we get a token that has length longer than 40. In practice, the latter condition is what triggered first in my setup. Both of these conditions were arbitrarily set from me. It would be interesting to understand what the &#8220;optimal&#8221; tokenization hyperparameters are for downstream performance, including the minimum number of samples needed, max token size, and number of tokens.</p><p>After running the tokenizer, the vocab size grew from 175 to 869. We can get a feel for the tokenizer&#8217;s effect by sorting the new vocabulary by token length and looking at the first few entries:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;66901429-7f90-4547-a211-ae2606df3866&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">print(sorted_stoi)

{
 'Once upon a time, there was a little girl named ': 868,
 'Once upon a time, there was a ': 449,
 'Once upon a time, ': 430,
 'little girl named ': 720,
 'Once upon a time ': 843,
 'Once upon a tim': 375,
 'there was a ': 360,
 'Once upon a ': 374,
 'little girl ': 475,
... [snip]
}</code></pre></div><p>Each entry above is a single token and will be represented by a single embedding vector in the model. This has effectively compressed the token space that our model has to learn from. Before BPE, the number of tokens was 1.1 billion; after tokenization the total drops to 359 million. That is a 67% reduction. </p><p>Now that we&#8217;ve run the tokenizer, we can tokenize the stories in the training and validation sets:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c93e4be9-d595-4810-8780-9415c46694a2&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def tokenize_string(s, sorted_stoi):
    tokens = []
    cursor = 0

    while cursor &lt; len(s):
        for k, _ in sorted_stoi.items():
            if s[cursor:].startswith(k):
                tokens.append(k)
                cursor += len(k)
                break
    
    return tokens

sorted_stoi = {k : stoi[k] for k in sorted(stoi, key=len, reverse=True)}

# We have the tokenization map stoi. Now we need to tokenize the training and validation sets
nr_trn = len(data_trn) * 6 // 10
nr_val = len(data_trn) // 10

trn_stories = data_trn[:nr_trn]
val_stories = data_trn[nr_trn:nr_trn+nr_val]

trn_tokenized = []
val_tokenized = []

trn_lens = 0
trn_tokenized_lens = 0

for s in trn_stories:
    tokens = tokenize_string(s, sorted_stoi)
    trn_tokenized.append(tokens)
    trn_lens += len(s)
    trn_tokenized_lens += len(tokens)

for s in val_stories:
    tokens = tokenize_string(s, sorted_stoi)
    val_tokenized.append(tokens)</code></pre></div><p>Note the above algorithm is completely serial and very sluggish. It could be sped up alot with multiple threads but I didn&#8217;t want to overcomplicate it.</p><p>Now to inform the maximum sequence length used during training, we can look at the distribution of tokenized story lengths in the training set:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sEAw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sEAw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 424w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 848w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 1272w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sEAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png" width="597" height="455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:597,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21436,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/188668806?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sEAw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 424w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 848w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 1272w, https://substackcdn.com/image/fetch/$s_!sEAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a138fc-5046-4c7e-a75c-d9dcdce69258_597x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Based on this I chose a sequence length of 512. Since many of the stories are less than 512, we need to pad them with <code>seq_len - len(story)</code> stop characters so that they all have 512 for batched training runs. While we&#8217;re at it we need to record the valid length of each story, which is the length of the story before any padding:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;b9ceff66-23fc-4ffe-9f70-c5bd41a7ccd2&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">seq_len = 512

trn_tokenized = [s for s in trn_tokenized if len(s) &lt;= seq_len]
val_tokenized = [s for s in val_tokenized if len(s) &lt;= seq_len]
trn_tokenized_lens = []
val_tokenized_lens = []

pads = []
for s in trn_tokenized:
    trn_tokenized_lens.append(len(s))

    if seq_len == len(s):
        pads.append([])
    else:
        pads.append([stop_char] * (seq_len - len(s)))

trn_tokenized = [s+p for s, p in zip(trn_tokenized, pads)]

pads = []
for s in val_tokenized:
    val_tokenized_lens.append(len(s))

    if seq_len == len(s):
        pads.append([])
    else:
        pads.append([stop_char] * (seq_len - len(s)))

val_tokenized = [s+p for s, p in zip(val_tokenized, pads)]

for s, l in zip(trn_tokenized, trn_tokenized_lens):
    assert len(s) == seq_len
    assert l &lt;= len(s)

for s, l in zip(val_tokenized, val_tokenized_lens):
    assert len(s) == seq_len
    assert l &lt;= len(s)</code></pre></div><p>The valid lengths are used for calculating the mask of the padding characters along with the future tokens in multi-head attention. This means that attention scores for the padding characters and the future tokens will be zero.</p><h2>Training and Performance</h2><p>At this point we are ready to train the one-layer transformer again, this time with our tokenized samples. There is a slight difference between the model we trained in the last post versus this one. Since the vocabulary is larger, the embedding matrix and the final linear projection will be larger. This also means the predicted distribution that the model produces will have 869 choices instead of just 175. This creates an apples-to-oranges situation when directly comparing the perplexity between models. Instead we can use bits-per-character (BPC), which normalizes the total log likelihood by the total number of characters. The total characters are the same between models. </p><p>The larger vocabulary increased the total parameters to over 10 million. It took about 38 hours to train on my system. The BPC of the prior model was 0.82 and the BPC of the model with tokenization was 0.55. That is a 32% performance boost!</p><p>Let&#8217;s sample a story from our tokenized model:</p><blockquote><p>Story time: Once upon a time there was a little girl named Lucy. She was 3 years old and she loved to explore. One day, Lucy decided to go outside and explore the world. Lucy was excited and she ran outside to explore.</p><p>As she walked, she noticed a big puddle of water. She wanted to take a closer look and thought it was so cool! She ran towards the door and looked around. She saw a big, green tree with a lot of smoke coming from inside.</p><p>&#8220;What is that?&#8221; she asked him.</p><p>Her mum smiled and said, &#8220;It&#8217;s an earthquake. You can open the door and see what will happen.&#8221;</p><p>And so, Lucy spent the whole day exploring the world. She found lots of exciting things in the puddles and putting them out of the open field. She was so happy and excited that she jumped up and down in excitement.</p><p>She ran around the puddle and laughed, and she ran around the house with her fingers. When she reached home, she was so excited to explore and make new friends. She had so much fun exploring the world around her.</p><p>The End.</p></blockquote><p>I think this is the best story yet! It isn&#8217;t perfect, there are still logical inconsistencies and issues with pronoun agreement. The story doesn&#8217;t follow a unified, coherent progression of thought. However it does seem to have a beginning, middle and end and overall is the most coherent story we&#8217;ve seen so far.</p><p>For reference, here is the sampled story from the very first model we trained from <a href="https://connorjdavis.substack.com/p/tour-de-language-modeling-part-1">Part 1</a> using Bengio et al.&#8217;s &#8220;Neural Probabilistic Language Model&#8221; which is essentially just a feed-forward network (FFN):</p><blockquote><p>Story time: Once upon a timk.e an ter pere eo hire sores the caed va boanshThr witlin. HtA. ThengerDpg, and ker ditgy und g nit, tayeur Vag anddbT&#201;dbvjogd isswarp!e wow,e. ouancs.&#8221;Tneyd-4%un6&#184;&#164;&#338;&#194;&#183;&#175; } Iy&#382;+&#8225;+&#8250;&#180;&#162;&#191;D&#187;&#225;jf&#201;&#381;&#176;&#233;G&#173;&#8482;yz&#8250;1&#338;&#194;&#353;&#175;&#187;{U9&#172;#&#179;&#8217;} %&gt;&#178;)&#184;&#8216;&#172;#&#339;j;&#202;q&gt;&#8216;&#230;&#201;&#181;Lb&#230;&#228;c&#174;&#232;.c&#381;39&#176;zc&#183;dxnomd.&#402;&gt;o&#166;t.mTe su&#338;lmvcyI&#162;&#8221;D&#225;&#339;&#8211;j&#339;&#179;;&#191;&#228;X&#233;cv&#8482;&#166;R&#402;&#184;2&#8217;F&#8249; @&#8250;&#8221;&#402;&#195;&#8250;6&#177;z&#353;&lt;&#176;b&#201;;&#174;&#174;&#210;`0 ?.&#196;#2&#187;&#225;B&#8221;&#183;&#8221;&#226;2&#180;F&#185;&#8230;&#165;&#174;@12&#167;9\&gt;&#710;&#167;&#163;V}&#229;4&#185;&#8364;F&#233;Q}&#166;&#169;&#161;&#168;&#177;&#188;&#175;&#8224;&#195;))`&#188;&#201;\Rz&#228;&#161;\&#172;#;&#179;Y&#376;&#176;vVL&#226;%&#196;&lt;Z&#230;&#179;&#175;&#233;O&#8218;&#195;M&#382;+`[&#8221;&#230;C&#226;j,C&#209;S&#352;\,&#185; ]O&#226;&#172;&#732;&lt;!&#230;&#230;&#210;&#175;Y&#230;&#161;&#710;9&#202;&#239;4g$&#189;?&#196;b&#239;&#201;?oBH&#732;&#228;&#177; ;&#227;R&gt;@)&#402;&#8240;&#710;=X&#240;&#165;&#185;P,?0=&gt;&#381;&#240;:&#8221;QW&#176;JFxQ(3\h&#8222;&#352;&#240;&#201;)X&#732;&#180;QD&#181;xj&#187;.&#162;&#201;?&#353;&#172;&#170;Rc&#179;&#352;&#239;&#352;&#172;&#173;qU&#162;E&#185;&#162;&#339;R0&#8240;2&#376;&#240;:&#381;+&#197;4&#161;&#186;^</p></blockquote><p>Again we can&#8217;t quite compare this directly with the tokenized transformer, since the FFN I trained was using character-level tokens. Regardless the improvements that were made in the 14 years between Bengio&#8217;s FFN and Vaswani&#8217;s Transformer are astonishing. And interestingly, the FFN is one of the key components inside the Transformer block. Now that we have BPE tokenization, the only major difference between this model and the GPT family of models from OpenAI is scale. GPT-2 has 12 layers. GPT-3 has 96 layers.The addition of two other major components, self-attention and the residual stream, combined with the scale of stacked Transformer blocks is what has elevated modern LLMs into the mainstream.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>This is a form of inductive bias that we &#8220;offload&#8221; onto the BPE tokenizer to free up capacity of our model for more high-level tasks.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>TinyStories has a smaller vocab since it just contains simple short stories, whereas GPT is trained on the Internet and needs a bigger vocab due to the higher entropy inherent to the dataset.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 6: Transformers]]></title><description><![CDATA[In Part 6 of this series on language modeling, we upgrade our model architecture from the LSTM to the Transformer.]]></description><link>https://www.connorjdavis.com/p/language-modeling-part-6-transformers</link><guid isPermaLink="false">https://www.connorjdavis.com/p/language-modeling-part-6-transformers</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Sat, 14 Feb 2026 23:36:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!D530!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In Part 6 of this series on language modeling, we upgrade our model architecture from the LSTM to the Transformer. You can see Part 5 of this series <a href="https://substack.com/@connorjdavis/note/p-186023312?r=1nb12u&amp;utm_source=notes-share-action&amp;utm_medium=web">here</a>. Our latest LSTM gave us a sample story like this:</p><blockquote><p>Once upon a time there were two fearful of many toys. They do not notice their fight. They liked to give the chimding into his room. There, they had a doll, I cut the brush to go away. Let&#8217;s decide it rown in your bones and your bike, Ben. You are brave and selfish.&#8221; They ask Mom and Dad.</p><p>&#8220;Go?&#8221; Lily said, pointing at the balloon. She hugged the doll bitter. She opened her around with her window. One day, she noticed something giragain and the airport. The little bird flew away, curious, and told her family for being so much fun.</p><p>Timmy felt happy with his game and went to her mom and stayed because no one wanted to see the flower. Lily realized that being happy she and Lily, was very surprise</p></blockquote><p>with a perplexity of 2.32. Most of the structure is correct, but the semantics of the story are still poor. In this post I want to train a simple transformer on the TinyStories dataset to try and improve performance.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><h2>Transformer Architecture</h2><p>The Transformer was introduced in the seminal paper <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>. It was the first instance of using attention in a non-recurrent fashion for language modeling tasks. You can think of attention as computing a weighted average that is differentiable. This gives attention a nice interpretation, as a token which is given higher weight is being &#8220;attended to&#8221; more intently. The attention weights are learned automatically through backpropagation. The attention weights can be anything in theory, but in practice they are the output of a softmax function. This leads to a probabilistic interpretation where each weight is the probability of selecting a given value. Here is the attention calculation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nattn(q, k, v) &amp;= \\sum_{k,v} \\alpha(q, k)v \\\\\n\\alpha(q, k) &amp;= softmax(qk^{T})\n\\end{align}&quot;,&quot;id&quot;:&quot;QKABVHLDGJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>In <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>, Vaswani et al. used a special kind of attention called self-attention. This was used so that each token<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> directly computes attention with itself and every other token in the sequence (except for some masking to prevent lookahead bias). These direct connections are the reason why the Transformer was such a breakthrough; they allow for efficient learning of long-term dependencies. This gives Transformers a massive advantage over prior recurrent-based architectures like the LSTM.  </p><p>The original Transformer paper also used multiple &#8220;heads&#8221; of self-attention. Each head has its own attention weights and attends to the same input. The idea is to let the model learn different high-level features/concepts from the input via divide-and-conquer. One head for syntax, one head for co-reference, another for long-range dependencies, etc<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. The output from each head is concatenated and projected into the &#8220;model dimension&#8221; which is the dimensionality of the token embedding space and model as a whole. The output of this self-attention block is then added to the residual stream of the input (an idea borrowed from <a href="https://arxiv.org/abs/1512.03385">ResNet</a>) and normalized before being fed into a 2-layer feed-forward network with a ReLU in the first layer. Finally the output of the FFN is added to the residual stream and normalized as input into a final projection into the logit space (i.e. back to the vocabulary dimension). The logits are fed through softmax for the final probabilities of the next token.</p><p>The final piece to make this all work is positional encoding. This is done at the beginning and added to the initial token embedding to let the model learn relative and absolute token positions. Positional encoding isn&#8217;t really needed in sequential models like the LSTM because the computational graph already inherently encodes the token positions. However self-attention has no knowledge of which tokens are before which, since it unconditionally attends to every token at once. Vaswani et al. used a fixed encoding using trigonometric series, but later papers like GPT-1 used learned positions.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-6-transformers?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-6-transformers?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Let&#8217;s Build It</h2><p>You can find the code for this post <a href="https://colab.research.google.com/drive/1sBDi6BkQwT4HoGe_0H8YAF5o5VerJ1jT?usp=sharing">here</a>.</p><p>The original Transformer model was applied to machine translation, so it had an encoder and a decoder. The encoder mapped tokens from the source language into the model&#8217;s representation space (called &#8220;d_model&#8221;), and the decoder decoded the representation to the destination language vocabulary. Since we are just doing next-token prediction from one input sequence, we can get away with just a decoder. Our decoder-only model has the following structure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D530!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D530!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 424w, https://substackcdn.com/image/fetch/$s_!D530!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 848w, https://substackcdn.com/image/fetch/$s_!D530!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 1272w, https://substackcdn.com/image/fetch/$s_!D530!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D530!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png" width="626" height="1012.0906593406594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2354,&quot;width&quot;:1456,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:282628,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/187784909?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D530!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 424w, https://substackcdn.com/image/fetch/$s_!D530!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 848w, https://substackcdn.com/image/fetch/$s_!D530!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 1272w, https://substackcdn.com/image/fetch/$s_!D530!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5309d40b-37fb-43d9-b72d-f90ef99cbf47_1698x2745.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This model is a single-layer, decoder-only Transformer with a final linear projection at the end into the vocabulary space. The softmax gives us the next-token probabilities. </p><h3>Multi-Head Attention</h3><p>The multi-head attention block is in blue. Each <code>attn</code> module computes scaled dot-product self-attention on identical inputs (the output of the positional encoding block) and linearly projects the attention-weighted values to a subspace of the model dimension. The model I trained for this post has a model dimension of 1024 and eight attention heads, each projecting into a 128-dimensional subspace. Each attention head computes the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;softmax(QK^{T}/\\sqrt{128}) V&quot;,&quot;id&quot;:&quot;ZDSQTDFPOJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Q, K, and V are all linear projections of the same token sequence into 128-dimensional subspaces. The output of each head is concatenated and projected back to the model dimension of 1024. Since this is a decoder-only Transformer, self-attention must be &#8220;causal&#8221;, meaning we can&#8217;t allow it to attend to tokens in the future. So the first token can only attend to itself. The second token can attend to itself and the first token, and so on. We can implement this by &#8220;masking&#8221; the softmax operation on a per-token basis. The mask is implemented by setting the columns with an index greater than the token index to a large negative number (e.g. <code>-inf</code> in PyTorch). Each entry with <code>-inf</code> will have a softmax of zero, and so the attention score of that entry will be zero. Performing the softmax over the column dimension effectively prevents the attention from &#8220;looking ahead" to future tokens. Here is a plot of the attention scores from each of the 8 heads, conditioned on the input:</p><blockquote><p>One day, Tim said &#8220;let&#8217;s go to the park to play&#8221;.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aSmA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aSmA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 424w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 848w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 1272w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aSmA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png" width="624" height="1180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1180,&quot;width&quot;:624,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100705,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/187784909?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aSmA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 424w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 848w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 1272w, https://substackcdn.com/image/fetch/$s_!aSmA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8647d86d-b167-4e18-b0a9-22e16bfb9565_624x1180.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In each of these, &#8216;O&#8217; is at the top of the y-axis and left of the x-axis, and the input proceeds from there. Masking is what creates the lower-triangular matrix effect. </p><p>If you squint you can see that head 3 (and head 4 as well for the most part) learned to strongly attend exclusively to the previous character. While head 1 learned to attend to spaces prior to the quote and the opening quote itself. The others are a bit harder to interpret. It is interesting that the model seemed to have learned to &#8220;delegate&#8221; different tasks to different heads.</p><h3>Residual Stream and FFN</h3><p>The output of the multi-head attention is added back to the residual stream, then layer normalized. Then the output of the first layer normalization is fed into a 2-layer feed-forward network with a ReLU nonlinearity. The output of the FFN is added back to the residual stream and layer normalized. The result is the output of the transformer decoder layer. Then there is a final linear projection into the vocab dimension which gives the logits of the next-token prediction. Finally these are normalized with softmax to get the predicted probability distribution of each token.</p><h2>Training and Performance</h2><p>This 1-layer model has 9.8 million parameters and took about 1.5 days to train on my system. For reference, GPT-2, the precursor to ChatGPT, has 48 layers with 1.5 billion parameters. So this model is still very small by today&#8217;s standards. Despite that, we still saw some performance gains over the LSTM.</p><p>The final validation loss was 0.76, beating the 1-layer LSTM by 7%. The perplexity was 2.30, edging out slightly ahead of the 2.32 from the 1-layer LSTM. </p><p>The sampled story quality is interesting. In previous posts, we have sampled the most likely character every time. This is a special case of the more general <em>top-k</em> sampling with <em>temperature</em> equal to 1.0. Top-k sampling samples the top k most likely tokens according to the predicted distribution. When k=1, we have the greedy approach from previous posts. Temperature controls the shape of the distribution. As temperature increases, the distribution flattens towards the uniform distribution. Higher temperature means the model spreads out the probability more evenly across each token:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I5B-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I5B-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 424w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 848w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 1272w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I5B-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png" width="826" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/032150c0-4ca4-4383-954a-4d637e986739_826x451.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:826,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21245,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/187784909?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I5B-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 424w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 848w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 1272w, https://substackcdn.com/image/fetch/$s_!I5B-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032150c0-4ca4-4383-954a-4d637e986739_826x451.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sampling with temp=1.0 and top-k=1 (as in prior posts), we get a story that is coherent at first, then interestingly lapses into repetition:</p><blockquote><p>Story time: Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big box of colorful flowers and she was so excited to see what was inside. </p><p>Lily was so excited to see what was wrong. She was so excited to see what was inside. She wanted to help her mom and said, &#8220;Lily, you can still she was so happy to have a new friend.</p><p>Lily was so happy to have a new friend. She was so happy to have a new friend. She was so happy to have her a big hug. She was so happy to have her a big hug. She was so happy to have her a big hug. She was so happy to have her a big hug. She was so happy to have her a big hug. From that day on, Lily always remembered to be careful when she was always be careful when she was so happy to have found the park. She was so happy to have her a big hug. From that day on, Lily always remembered to be careful when she was able to be careful when she was able to help her friends. They were happy to have a new friend.</p></blockquote><p>We can reduce the repetition by increasing the top-k value and decreasing the temperature slightly. Here is a story with temp=0.7 and top-k=20:</p><blockquote><p>Story time: Once upon a time, there was a little boy named Timmy. Timmy loved to play with his toys and run around in the woods. One day, Timmy&#8217;s mom told him to his mommy and said, &#8220;Mommy, what&#8217;s that it&#8217;s important to be careful.&#8221;</p><p>Mommy said, &#8220;Okay, Timmy, you&#8217;re car book will be careful picture. They both went to the beach and said, &#8220;Mommy, I help you.&#8221; His mom wash the candy. He could do it and said, &#8220;That&#8217;s a word. Mommy said, &#8220;Okay, I will help you do it!&#8221; Timmy was scared and said, &#8220;Thank you, Timmy. You can use your dress and carrots of cool!&#8221;</p><p>Timmy was so proud of himself for his dark and played with Max. They went back and playing with his mom. They were happy to see Mr. From that day on, Timmy was glad went on the big wave came back and he was safe. Timmy learned that it&#8217;s important to go away to the forest. They were playing until the air and he would always very happy. They all lived happily ever after. Timmy was so happy to have fun in the sky and Timmy went outside </p></blockquote><p>This version is much more story like. It also has much better coherency than the LSTM version featured at the top of this post. Though still not perfect, it is approaching a representative story from the underlying dataset. </p><p>It is also fun to see what happens when the temperature is increased substantially. Here is a &#8220;story&#8221; with temp=10 and top-k=20:</p><blockquote><p>Story time: Onmy.b.lMiaa tw pysitzm olwb,&#226;Wtb,! O a llam ub agodh?! NEdor,.v at!o is&#8217;-migat,wy.M-gete! Bme dttzoeslar?go.llss&#8217;,</p><p>WDu scrofm?!</p><p>.Holdyr,n,::! </p><p>Oe.&#8217;l</p><p>nuim yte can&#226;&#8364;-miff;.Jx&#8217;b peer!. Shiopr.Mors,s,s!tVe&#8217;:?&#226;&#8364;&#339;Ge rea ppcbalins&#226;drus!&#226;&#8364;ic emter.</p><p>Meff.&#226;ms.Te,tbcdusnnsq</p><p>afl jdb, it tmysqeoul&#8217;&#226;Soppo&#226;&#8364;</p></blockquote><p>Increasing the temperature increases the entropy of the output since sampling approaches the uniform distribution.</p><p>In the next post, I will experiment with a couple of things to try to increase performance further. </p><p>The first is the tokenization scheme. The current tokenization scheme maps one character to one token. One alternative, used heavily in real language models, is to use byte-pair encoding (BPE). BPE sits somewhere in the middle between character-level tokens and word-level tokens. It effectively compresses the token space, which means that for a given sequence length, the model can learn longer-range dependencies from the source dataset. This should result in better coherency, especially for a fixed compute budget.</p><p>The second is scaling. The transformer in this post is just one layer and is fairly narrow. However the transformer architecture is inherently scalable, which is why the frontier models these days use multiple 10s of layers with wide capacities. We&#8217;ll train a bigger model to see how it affects performance.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In the context of this series, a token is just a single character (for now)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>At the moment it is highly non-obvious to me why and how the model would be able to delegate these different functionalities to different heads, other than just &#8220;because gradient descent&#8221;. A topic for further exploration.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 5: Reverse Engineering LSTM Cells]]></title><description><![CDATA[In Part 5 of this series on language modeling, we linger a bit on the LSTM to peek under the hood in order to better understand the network&#8217;s internals.]]></description><link>https://www.connorjdavis.com/p/language-modeling-part-5-reverse</link><guid isPermaLink="false">https://www.connorjdavis.com/p/language-modeling-part-5-reverse</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Thu, 05 Feb 2026 00:23:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LZML!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In Part 5 of this series on language modeling, we linger a bit on the LSTM to peek under the hood in order to better understand the network&#8217;s internals. If you&#8217;re just joining, you can read Part 4 <a href="https://connorjdavis.substack.com/p/lanugage-modeling-part-4-lstms">here</a>. Our latest 1-layer LSTM we trained from Part 4 on the <a href="https://huggingface.co/datasets/roneneldan/TinyStories">TinyStories</a> dataset generates stories such as:</p><blockquote><p>Once upon a time there were two fearful of many toys. They do not notice their fight. They liked to give the chimding into his room. There, they had a doll, I cut the brush to go away. Let&#8217;s decide it rown in your bones and your bike, Ben. You are brave and selfish.&#8221; They ask Mom and Dad.</p><p>&#8220;Go?&#8221; Lily said, pointing at the balloon. She hugged the doll bitter. She opened her around with her window. One day, she noticed something giragain and the airport. The little bird flew away, curious, and told her family for being so much fun.</p><p>Timmy felt happy with his game and went to her mom and stayed because no one wanted to see the flower. Lily realized that being happy she and Lily, was very surprise</p></blockquote><p>These stories are so good I&#8217;m going to have to start charging a subscription! Kidding, and if you decide to tell this one to your kids I&#8217;m not responsible for their resulting nightmares or subsequently poor English proficiency.</p><p>Ok so the story has a few invalid words and is incoherent nonsense overall, but the syntactic structure the model has learned to generate is pretty good. In particular you can see it has mostly figured out quotations as well as punctuation, spacing, and some subject-verb agreement.</p><p>What I want to do in this post is probe the internals of the model to map out where different capabilities live. For example, can we find the hidden unit(s) responsible for recognizing quotation marks? What about other punctuation, or particular words? Can we control the expression of these capabilities by modifying the hidden units in a particular way? Most of these questions are inspired by Karpathy&#8217;s excellent paper <a href="https://arxiv.org/pdf/1506.02078">Visualizing and Understanding Recurrent Networks</a>. I recommend that paper for good background material. Let&#8217;s explore these questions in the next section.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-5-reverse?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-5-reverse?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Visualizing LSTM Cells</h2><p>Recall the structure of the LSTM:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TOLG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TOLG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TOLG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg" width="724" height="654.6965376782077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:982,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:74875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TOLG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TOLG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcddf14b-72c0-4097-90eb-230f56e83ff3_982x888.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The example cell above has only 8 hidden units (hidden_dim=8) for drawing purposes. The LSTM that we trained in Part 4 uses a cell with 256 hidden units (note I&#8217;m going to refer to &#8220;hidden unit of the cell&#8221; as just &#8220;cell&#8221;).</p><p>If we want to attribute different cells to various aspects of the input, we can generate a trace of the activation values of each cell for a given string. There is one activation value per token of the input string. This string can contain tokens of interest, such as vowels, quotes, and other punctuation. Then we can scan the activation maps of each cell to look for patterns. </p><p>Here is the code for generating a trace: </p><pre><code># Collect all activation values of cell @cell_idx. In the
# LSTM there are 256 total cells (note the "cell" usually refers to
# the entire memory cell of the LSTM - I'm abusing the terminology
# here slightly by referring to a particular hidden unit
# (given by @cell_idx) within the cell as "cell").

def trace_cell(lstm, string, cell_idx):
    assert cell_idx &lt; lstm[1].hidden_dim # 256
    hidden = cell = None
    activation_trace = []

    for char in string:
        c = torch.tensor(stoi[char], device=device)
        
        # Embed the token
        x = lstm[0](c)
        x = x.unsqueeze(dim=0).unsqueeze(dim=0)

        # Generate prediction
        x, hidden, cell = lstm[1](x, h=hidden, c=cell)

        # assumes cell is (1, hidden_dim)
        activation_trace.append(cell[0][cell_idx].item())

    return activation_trace</code></pre><p>You can see the full code for this <a href="https://colab.research.google.com/drive/1h1N6poK-bguX0S9hzesJL9YxSqWdWovr?usp=sharing">here</a>. When you run this for each cell with the input string:</p><blockquote><p>Once upon a time, there was a girl named Lucy. Lucy asked Bob, &#8220;Why did the chicken cross the road?&#8221;</p></blockquote><p>we see some interesting patterns emerge. Note in the activation heatmaps below, purple means the cell is not activated, greenish is neutral, and yellow is highly activated. The corresponding line plot of the activation values versus token is below each heatmap.</p><p>Here is Cell 48:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y1vL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y1vL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y1vL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png" width="725" height="725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:36956,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y1vL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Y1vL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0ee441d-0ac0-4bd1-99bc-0c6d54ce99a7_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!evpO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!evpO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!evpO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!evpO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!evpO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!evpO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png" width="724" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/695ceb52-650c-40be-8669-c86e25efd07c_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:91469,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!evpO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!evpO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!evpO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!evpO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F695ceb52-650c-40be-8669-c86e25efd07c_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cell 48 is fairly noisy, but appears to be excited by sequences ending in &#8216;e&#8217; and &#8216;y&#8217; like &#8216;Once&#8217;, &#8216;the&#8217;, and &#8216;Why&#8217;.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><p>Let&#8217;s look at Cell 46:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZZLZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png" width="725" height="725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de84b557-3bb8-472c-9470-ec0c1a355439_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:37490,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!ZZLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde84b557-3bb8-472c-9470-ec0c1a355439_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RDbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RDbm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RDbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png" width="727" height="727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:86281,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RDbm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!RDbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9a989d-6279-4ebf-a312-78a2b442b458_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It seems to only get excited about &#8216;girl&#8217;. Of course to verify this, we would need to test with other input strings containing &#8216;girl&#8217; in different positions, and also with ones not including &#8216;girl&#8217; at all. </p><p>Now check out Cell 253:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l38R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l38R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!l38R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!l38R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!l38R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l38R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png" width="724" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:36117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l38R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!l38R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!l38R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!l38R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac668dd-5086-4073-b423-c0eb8e5d2444_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LZML!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LZML!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!LZML!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!LZML!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!LZML!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LZML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png" width="725" height="725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:47922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LZML!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!LZML!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!LZML!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!LZML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d3e9595-155b-4fcb-9a25-df66dc332613_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It looks like a cell that is signaling for quoted strings, since the activations of the quotes themselves are as active as the quoted content. The activation is negative until is sees the first quote, then stays positive every token in the quoted string, including the last quote. The large dip prior to the quote is quite puzzling!</p><p>At this point there are more questions than answers. For example, if we retrain the model will we see the same cell patterns at the same cell index? How do the input, forget, and output gates behave in relation to the activations we are seeing? What would happen to the model&#8217;s capabilities if we nerf a cell? Let&#8217;s double click on this last cell to explore some of these questions.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-5-reverse?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-5-reverse?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Digging Deeper Into the Quote Signaler</h2><p>I retrained the single-layer model to see if the cell patterns were stable. Here is the plot of Cell 253:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WoKG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WoKG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WoKG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png" width="900" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37312,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WoKG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!WoKG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d7a6c12-fefc-4884-a8e3-10842f41d7f0_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It&#8217;s different! What happened? I had to look through all cells in the new model to see if it had moved. Sure enough I found it, but this time at cell 178:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwEp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwEp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwEp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png" width="900" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fwEp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fwEp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e9a6bb-6620-4316-8fb2-fc71080639ec_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2XxT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2XxT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2XxT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png" width="900" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53631,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2XxT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!2XxT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c63bf5-a903-47cf-9119-571b8cdd65d4_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Why did it move? Turns out that I had been accidentally initializing the weights of the model without a constant-seeded random number generator. This means the weights had different initial values than the first run. That is the only difference I can think of. It is not obvious exactly <em>why</em> this would cause it to move to cell 179 instead of staying at 253, but it is very interesting that the model learned it regardless. Roughly the same pattern holds - activation is negative outside of the quote, and positive inside the quote.</p><h3>Analyzing Gates</h3><p>Now lets turn to the gates&#8217; behavior. We can create the same heatmap and line activation plots to see how the gates react as the input proceeds. Recall the input gate controls how much of the current token&#8217;s representation to let into the cell. So an activated input gate is excited about the current token and wants to ensure it is represented in the cell. An input gate that is deactivated  doesn&#8217;t want to let that input in the cell.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2-m3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2-m3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2-m3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png" width="713" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:713,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36270,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2-m3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!2-m3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f561dd8-2b8a-4c06-88e3-fe8e0d49e20f_713x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xGvq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xGvq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xGvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png" width="900" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xGvq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!xGvq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542847be-6bad-4df5-84a6-127c4eb2e6e8_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The input gate is fairly noisy, especially at the beginning. In general though it seems to activate for spaces and punctuation. This would make sense for a gate trying to signal for quotes, since for this particular dataset, a large percentage of quotations are preceded by spaces and punctuation. You can see this by calculating the frequency statistics of all characters that precede a quote. For TinyStories, this distribution follows a power law:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eEyD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eEyD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 424w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 848w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 1272w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eEyD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png" width="764" height="759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:759,&quot;width&quot;:764,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17890,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eEyD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 424w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 848w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 1272w, https://substackcdn.com/image/fetch/$s_!eEyD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7917c5a3-5c5c-44e8-a0ce-dd2f3edaa1d4_764x759.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What is really fascinating is that initially the gate is interested in spaces, but that interest fades as the sequence gets longer. However the interest in punctuation remains elevated throughout. </p><p>Let move on to the forget gate. Remember the forget gate is &#8220;active low&#8221;. When the forget gate is high, it wants to retain the previous cell&#8217;s representation in the current cell. If it is low, it wants to remove the previous cell&#8217;s representation from the cell. Here is the heatmap:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jSh-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jSh-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jSh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png" width="724" height="746.3394109396914" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/309a4e52-f204-48b1-b416-35827957195f_713x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:713,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:31194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jSh-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!jSh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F309a4e52-f204-48b1-b416-35827957195f_713x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OC5o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OC5o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 424w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 848w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 1272w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OC5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png" width="773" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:773,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OC5o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 424w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 848w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 1272w, https://substackcdn.com/image/fetch/$s_!OC5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275d55ab-1604-47be-855c-8f3d82a7896d_773x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see it partially retains the beginning character and in general holds high, retaining the cell&#8217;s state until it reaches the first quote, where it then completely clears the prior cell&#8217;s representation from the current cell value. This means the cell&#8217;s value for the quote token is whatever is passed through the input gate, which happens to be a fairly strong representation of the quote itself, based on the input gate activation above. It then immediately activates near 1 on the next token and holds high until the next quote. The forget gate seems to be implementing a primitive state machine of {inside quote, outside quote}, where transitions occur whenever a quote is encountered.</p><p>Finally let&#8217;s look at the output gate. The output gate controls how &#8220;strongly&#8221; the cell&#8217;s representation is written into the hidden state. A value of zero resets the hidden state, whereas a value of one copies the cell verbatim into the hidden state.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zneY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zneY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!zneY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!zneY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!zneY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zneY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png" width="724" height="746.3394109396914" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:713,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:35425,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zneY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!zneY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!zneY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!zneY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3abffdd-d4f2-4ebe-8e2d-4c2da1de22b6_713x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mDFD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mDFD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mDFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png" width="727" height="727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:900,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:98923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mDFD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 424w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 848w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 1272w, https://substackcdn.com/image/fetch/$s_!mDFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04b38ad3-a053-4c37-937f-c4ac83dd41b3_900x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This one is a bit harder to interpret than the forget gate. One thing to notice is that none of the activations are zero. So every token has at least some of the cell being written to the hidden state. We also see that on average, the output gate is more activated before the first quote. However there are &#8220;near clearing&#8221; events before and after the quote, so it isn&#8217;t clear how that is being used. It is notable that the punctuation activations are high. Combined this with the high punctuation activations for the input gate and forget gate, it suggests the output gate has learned to let the punctuation flow into the hidden state. </p><p>The running hypothesis is that this cell encodes a state machine, where the cell activation is negative when outside of quotes, and positive when inside quotes. We can re-trace the cell on different inputs to see if this holds up. I created two sets of strings from the validation set - 100 quoted and 100 unquoted. I then traced cell 178 on each string.</p><p>Here is a plot of the maximum activation values across all strings:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yfb6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yfb6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 424w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 848w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 1272w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yfb6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png" width="727" height="578.144424131627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:547,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:38843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yfb6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 424w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 848w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 1272w, https://substackcdn.com/image/fetch/$s_!yfb6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce1f5704-355c-4eca-bcb2-b39f39c53b4d_547x435.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So we&#8217;ve falsified the hypothesis, since the unquoted strings have positive values despite not having any quotes. However the quoted strings consistently (except for one string) have a max that is greater than the unquoted strings. The majority of the time (96%), the character that gives the max activation is the quote:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mKYh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mKYh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 424w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 848w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 1272w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mKYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png" width="724" height="578.4130434782609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:552,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:12597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mKYh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 424w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 848w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 1272w, https://substackcdn.com/image/fetch/$s_!mKYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe619f3-cb51-474b-b829-9759eb6342b4_552x441.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is an example heatmap and activation line plot from an example quoted string:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Szet!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Szet!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!Szet!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!Szet!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!Szet!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Szet!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png" width="713" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:713,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:188050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Szet!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 424w, https://substackcdn.com/image/fetch/$s_!Szet!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 848w, https://substackcdn.com/image/fetch/$s_!Szet!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 1272w, https://substackcdn.com/image/fetch/$s_!Szet!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60df90ca-cedf-449b-ac51-d5194efbd93c_713x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H8my!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H8my!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 424w, https://substackcdn.com/image/fetch/$s_!H8my!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 848w, https://substackcdn.com/image/fetch/$s_!H8my!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 1272w, https://substackcdn.com/image/fetch/$s_!H8my!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H8my!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png" width="785" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:785,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H8my!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 424w, https://substackcdn.com/image/fetch/$s_!H8my!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 848w, https://substackcdn.com/image/fetch/$s_!H8my!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 1272w, https://substackcdn.com/image/fetch/$s_!H8my!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fb35ca-767e-4d20-935a-9ffecea140a4_785x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These plots suggest the cell is most excited for quotes, followed by punctuation, rather than being a state machine indicating inside quotes / outside quotes. </p><h3>Nerfing Cell 178</h3><p>Suppose we really don&#8217;t want quotes in our generated stories. For some reason, the stories with quotes keep the kid up at night, instead of putting them to sleep like a proper bedtime story should (probably due to the anticipation of the closing quote!).</p><p>Could we use our knowledge of cell 178 to prevent the model from generating stories with quotes? The tricky part is we don&#8217;t want to hurt model performance too bad (we can&#8217;t afford that!); we just don&#8217;t want quotes in the output. The problem is that cell 178 is <em>influenced by</em> and <em>influences</em> every other cell through the recurrent matrices in the gates and candidate cell state. Some of these cells I haven&#8217;t listed here (due to time and space constraints) seem &#8220;interested&#8221; in quotes suggesting that there is a quote <a href="https://distill.pub/2020/circuits/">circuit</a>, a graph of cells that work together to deliver the overall quote capability. But perhaps we can get lucky by clipping the value of cell 178 just right so that quotes are not generated or generated less when we sample. If we clip the activation value so that it doesn&#8217;t exceed 0.8 (based on the max values above), then the cell should behave similar to the un-nerfed version for the majority of tokens except quotes. Hopefully this will limit the second-order effects on cells which depend on cell 178.</p><p>We can generate 100 stories without modifying cell 178 to get a baseline percentage of quotes, then generate new stories after clipping any value greater than 0.8 to 0.8. In each scenario we can measure the perplexity to see the total impact to performance and the total number of quotes in the string.</p><p>The baseline 100 stories without cell modifications gave a total of 363 quotes. The perplexity was 2.2271.</p><p>Clamping the activation of cell 178 to 0.8 resulted in 298 quotes, which is about an 18% reduction. The perplexity actually fell slightly to 2.2114. We can try to clamp with smaller values to see its affect on quote generation. Here are some plots for number of quotes generated and perplexity versus activation limit (0.8 is on the far right at 298):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eQYT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eQYT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 424w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 848w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 1272w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eQYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png" width="724" height="576.9176882661997" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:571,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:29268,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eQYT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 424w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 848w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 1272w, https://substackcdn.com/image/fetch/$s_!eQYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F795e00e4-a66f-44f2-8f86-dc49b1263fbb_571x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h7oB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h7oB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 424w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 848w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 1272w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h7oB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png" width="725" height="572.6996527777778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:576,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:28448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/186023312?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h7oB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 424w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 848w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 1272w, https://substackcdn.com/image/fetch/$s_!h7oB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa316b8f2-bd3a-499b-b57c-36f001b88a0e_576x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is interesting that we can limit the cell all the way to -5 before seeing hardly any change to perplexity. We do begin to start trading off less quotes for losses in perplexity after -5, but otherwise it seems we&#8217;ve definitely found at least one of the cells responsible for quote generation! To completely erase quotes would take more work tracking down the other cells that are working with cell 178.</p><p>Not bad for a quick peek under the hood of our LSTM. There are other things that would interesting that I didn&#8217;t cover, like tracking down the cells that influence cell 178 the most and vice versa. Also understanding what the most &#8220;exciting&#8221; input would be for a given cell. This is tricky since we are dealing with a sequence of inputs rather than just one. It would require carefully constructing a loss function to backpropagate into the input sequence, perhaps maximizing the autocorrelation (i.e. &#8220;trendiness&#8221;) of the cell activation curve in addition to the magnitude.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 4: LSTMs]]></title><description><![CDATA[Welcome to Part 4 of a series on language modeling.]]></description><link>https://www.connorjdavis.com/p/lanugage-modeling-part-4-lstms</link><guid isPermaLink="false">https://www.connorjdavis.com/p/lanugage-modeling-part-4-lstms</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Sat, 24 Jan 2026 18:14:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-I68!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 4 of a series on language modeling. In <a href="https://connorjdavis.substack.com/p/language-modeling-part-3-vanilla?r=1nb12u">Part 3</a>, we got familiar using vanilla RNNs for the next-character prediction task. We saw the main advantage of adding recurrence to the network is to keep track of long-term dependencies in the input stream. For example, predicting the closing quote in </p><blockquote><p>John said, &#8220;I can&#8217;t make it to the gym today because I have to work&#8221;</p></blockquote><p>requires the model to remember the opening quote at the beginning. Recurrence allows the model to remember previously seen tokens so that it can consider them for the current prediction.</p><p>This is the sample story from the latest model:</p><blockquote><p>Story time: Once upon a little eazings ont Day, I kidgo preate it said. At and you were ittlied. The dreing frien fell backy. She is a camed an. Dadry, Tommy said, &#8220;Now, Timmy. He good they fate ause wime caplo. Shew it a with and purnt wike to na;e and hoar. Her and it was he a wourhan. Bobse time, tak mom in the mom togetecide is the bout his mommy, they prele big to think. The rail a citcus but the ponf the she friendsn. The finger ick will and do sturted the tallo!&#8221; Biknywnedded usever,o gand and ce oven tore, girl flew the parn.</p></blockquote><p>Obviously still not performing very well.</p><p>The problem with the vanilla RNN is that the recurrent hidden state leads to instability during training - gradients tend to vanish or explode. This instability led to the development of initialization techniques that ensure the hidden weights have a <a href="https://connorjdavis.substack.com/i/184210105/the-curse-of-recurrence">principal eigenvalue of 1</a>. This somewhat helps prevent the gradient problem, however due to numerical drift over long sequence lengths and other terms in the gradient, gradients as a whole still have propagation issues.</p><p>This motivated the development of alternative architectures which allow for better gradient flow through the sequence. In this post we will look at the most prominent recurrent successor<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> to the vanilla RNN, the <em>long short-term memory </em>(LSTM).</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/lanugage-modeling-part-4-lstms?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/lanugage-modeling-part-4-lstms?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Long Short-Term Memory</h2><p>Both RNNs and LSTMs have a hidden state <code>h_t</code>. The hidden state evolves over the sequence dimension, and represents the model&#8217;s output at the end. In order to track long term dependencies, the LSTM adds a new state element called the <em>memory cell</em> (or just <em>cell</em> for short, denoted <code>c_t</code>). </p><p>The terminology here was confusing to me when I learned this. The cell <code>c_t</code> is completely internal to the LSTM block, so in a sense it is completely hidden. The hidden state <code>h_t</code> is updated internally and <em>eventually output from the model</em>, so in a sense the hidden state is quasi-hidden. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-I68!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-I68!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 424w, https://substackcdn.com/image/fetch/$s_!-I68!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 848w, https://substackcdn.com/image/fetch/$s_!-I68!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 1272w, https://substackcdn.com/image/fetch/$s_!-I68!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-I68!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png" width="982" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:982,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184826287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-I68!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 424w, https://substackcdn.com/image/fetch/$s_!-I68!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 848w, https://substackcdn.com/image/fetch/$s_!-I68!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 1272w, https://substackcdn.com/image/fetch/$s_!-I68!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129b6d7-52b5-4657-aee8-082f239e8df7_982x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A zoomed-in look at an LSTM cell. Each cell hidden unit is a function of the current token input and the value of the hidden unit&#8217;s previous state</figcaption></figure></div><p>You can think of the cell as a chunk of volatile RAM. Each element in the cell is a hidden unit that can be read from, written to, and reset to zero, analogous to DRAM. These operations are controlled via the output gate (read), input gate (write), and the forget gate (reset). Each of these gates is implemented with a sigmoid non-linearity to provide a differentiable operation of a binary choice. </p><p>When the input gate is 0, the cell is effectively ignoring the current token. When it is 1, the cell is prioritizing the current token. The forget gate is &#8220;active low&#8221;. When it is 0, the previous cell state is forgotten; when it is 1, the previous cell state is remembered. The output gate controls the degree to which the internal cell state is &#8220;released&#8221; into the hidden state <code>h_t</code> as output.</p><p>The other key ingredient that enables the LSTM to keep track of long-range dependencies like quotes, braces, etc. is to feed the hidden state into the gates. The hidden state provides the historical context the gates need to encode state like &#8220;we are currently inside a double quote&#8221;.</p><p>Take the following string:</p><blockquote><p>The writer said, &#8220;you need to use double quotes more.&#8221;</p></blockquote><p>Consider what the input and the forget gates need to do for a hidden unit in the cell that tracks quotes. At the first quote, the input gate needs to saturate to one and the forget gate needs to saturate to zero. This effectively latches the &#8220;inside quote&#8221; state representation into the particular hidden unit in the cell. Further, the output gate needs to saturate to one, so that the hidden state encodes &#8220;inside quote&#8221;. Then for the subsequent characters, this hidden state is fed into each of the three gates, effectively encoding a state transition to &#8220;inside quote&#8221;. For the non-quote characters, the input gate now saturates to zero and the forget gate saturates to one (i.e. to keep the hidden unit cell state the same). The output gate is still saturated to one to preserve the hidden state. Finally, once the closing quote arrives, the input gate transitions towards one and the forget gate resets towards zero. This transitions the cell&#8217;s hidden unit to &#8220;outside quote&#8221; again, which is fed through the output gate into the hidden state.</p><p>Now you may be wondering, what is the point of creating this new cell state? Why can&#8217;t we just use the hidden state as in the vanilla RNN?</p><p>The reason is that the LSTM construction enables better gradient flow backwards through  the sequence dimension. To see why we can look at the code snippet around the <code>cell</code> calculation (you can find all the code for this post <a href="https://colab.research.google.com/drive/10sWTbwY02wpolSnuhl2ASelbCe8wiAkt?usp=sharing">here</a>):</p><pre><code># LSTM forward pass function
def __call__(self, x, h=None, c=None):
    ...
    new_cell = torch.tanh(batch @ self.W_xc + self.hidden @ self.W_hc)
   
    # (1) cell is child of sum operation
    self.cell = f * self.cell + i * new_cell

    # (2) cell is child of non-linearity tanh
    self.hidden = o * torch.tanh(self.cell)

    y = self.hidden @ self.W_hy + self.b_y
    outputs.append(y)
    ...</code></pre><p>You can see there are two instances of <code>cell</code> on the right-hand side. This means the gradient of the loss with respect to <code>cell</code> will ultimately be a sum of two terms. The second instance is similar to the calculation of the hidden state in the vanilla RNN in that it is a child of the tanh in the computational graph. The first instance is the critical part, as it is the child of an addition. Since addition directly distributes gradients, this provides an alternative route for gradient to flow to <code>cell</code> across the full sequence dimension. This route provides a bypass around the potentially saturating tanh non-linearity.</p><h2>LSTM Performance</h2><p>Let&#8217;s train the LSTM to compare with the vanilla RNN. Note that I added an instance of the <a href="https://arxiv.org/abs/1412.6980">Adam optimizer</a> in the training loop. This was used to train both the RNN and LSTM with a sequence length of 32 and hidden dimension of 256. Adam is outside the scope of this post, but I highly recommend reading <a href="https://distill.pub/2017/momentum/">Why Momentum Really Works</a> in case you are curious to learn more.</p><p>The RNN ended with a loss of 1.058 and perplexity of 2.86. This means simply adding Adam provided roughly 33% improvement in the loss and 25% improvement in perplexity!</p><p>The LSTM ended with a loss of 0.9177 and perplexity of 2.45. So even with a fairly small sequence length of 32, the LSTM outperforms the vanilla RNN.</p><p>Here is a sample story from the RNN:</p><blockquote><p>Story time: Once upon a time, there was a little boy named Timmy. You don&#8217;t move. And he reached the grass her toys.</p><p>&#8220;Of cou did not know. She wants to go home back in the cells. They did it was too enter and said, &#8220;No,&#8221; Ben got angry angry. Sarate was so happy. They be okay. You are wrong cricket. The bott. She saw lots of lion teaches lesson to have a big bed and a big numbers. They liked to lay.</p><p>One day, she watched and said, &#8220;That was a room, Ben looked zigingly. The moral of toys and decided to live energe.&#8221;</p><p>Lily got theard something scarf.</p></blockquote><p>Now that is much better. We can see most of the words are valid English, though still a few that aren&#8217;t. We can also see that the model has learned order and structure of quoted phrases in relation to the subject that is saying them.</p><p>Here is the LSTM:</p><blockquote><p>Story time: Once upon a time, there was a small bird playing tight he would always made a mess.</p><p>Suddenly, their mom smiled and said, &#8220;No, new and how to urge her mom. She did not want to send it in his room. He wanted to play in the park with.</p><p>Aman was so happy that Lily was walking in the park. She wondered why he was old twigs, but one day it said. &#8220;Okay, but you are curious. She went to recorded his wings and answer. Afterwards, he shouted on the way. They do not know that a cupboard before in the bench. Tim determined to quarrel have seen the water. Then asked her dad, &#8220;I&#8217;m sorry, Ben,&#8221; Ben says.</p><p>&#8220;F&#8217;m cool!&#8221; She wished he tried to talk to the store and put some strong store on a piece. He played in</p></blockquote><p>This one is even better! All the words except for &#8220;F&#8217;m&#8221; are valid. Now the story still doesn&#8217;t have much coherency, but the structure of the story is much better.  The main issue is the subjects are coming and going with no real connection to an underlying narrative. Keep in mind that this LSTM used a sequence length of 32 for training. If we want to have coherency across larger portions of the text, perhaps we should have a longer sequence length to capture longer range dependencies.</p><p>When we re-train with a sequence length of 128, the loss drops from 0.9177 to 0.8285 and perplexity decreases from 2.45 to 2.32. Here is a story:</p><blockquote><p>Story time: Once upon a time there were two fearful of many toys. They do not notice their fight. They liked to give the chimding into his room. There, they had a doll, I cut the brush to go away. Let&#8217;s decide it rown in your bones and your bike, Ben. You are brave and selfish.&#8221; They ask Mom and Dad.</p><p>&#8220;Go?&#8221; Lily said, pointing at the balloon. She hugged the doll bitter. She opened her around with her window. One day, she noticed something giragain and the airport. The little bird flew away, curious, and told her family for being so much fun.</p><p>Timmy felt happy with his game and went to her mom and stayed because no one wanted to see the flower. Lily realized that being happy she and Lily, was very surprise</p></blockquote><p>It seems performance is roughly the same - the quality of the stories are within the same neighborhood. Most of the words are valid, but the coherence is still lacking.</p><p>One more thing we can do is stack LSTMs into multiple layers. In this stacked arrangement, the hidden state outputs for the first layer are fed into the input of the second layer.</p><p>The loss from a 2-layer LSTM drops down to 0.7964 with perplexity 2.22. Here is a story sample:</p><blockquote><p>Story time: Once upon a time there was a hunter. She liked to put away them in her little kitten. On the beach winter and was walking home, there were many fish!</p><p>Jack made sure to keep him available felt feeling helpless and sparkly, treat, and was getting tracks. The post was the most long in the wheel said, &#8220;Thank you should never have to stay up for a while. What do you tag your thome?&#8221;</p><p>At the store, her mom wanted the new glove and put it in the riverom next door. Then his owner told him about the been before dinner. He ran home and began to joke. He waved goodbye to the loud of the volcano. The sailor said, &#8220;This is enterice, me!&#8221; and warned the best joke finish his favourite sleep.</p></blockquote><p>Despite the slightly better loss and perplexity, the quality of the story from the 2-layer seems about the same as the one layer version. However, the model has improved significantly over the vanilla RNN. It has learned to balance out quotations and the placement of punctuation, spaces, and line breaks. Overall the stories still lack coherency though, so there is more work to be done.</p><p>Before we try to improve upon the LSTM, I want to take some time to visualize the internals of the LSTM, inspired by Karpathy et al.  <a href="https://arxiv.org/pdf/1506.02078">Visualizing and Understanding Recurrent Networks</a>. We will take a look at this in the next post.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The most prominent <em>non-recurrent</em> successor is the transformer, which we will see later in this series</p></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 3: Vanilla RNNs]]></title><description><![CDATA[This is Part 3 of a series on language modeling.]]></description><link>https://www.connorjdavis.com/p/language-modeling-part-3-vanilla</link><guid isPermaLink="false">https://www.connorjdavis.com/p/language-modeling-part-3-vanilla</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Fri, 16 Jan 2026 23:38:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cPH9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is Part 3 of a series on language modeling. You can read Part 2 <a href="https://substack.com/home/post/p-183195738?source=queue">here</a>. In Part 2 we finished up with a couple of techniques for stabilizing training - Xavier initialization and LayerNorm. These improvements brought our <a href="https://colab.research.google.com/drive/1BjDK0nVW5J9XkmQuInvp9X4KdZmvtz6x#scrollTo=kgXBEW_aunm6&amp;line=2&amp;uniqifier=1">perplexity down</a> from 5.73 to 4.50. Here is a sampled story from that model:</p><blockquote><p>Story time: Once dook had rine. It was tore on. They fubbe= ef&#173;, storisk &#226;un wookgar. The fabiing. He soulld eed sou jrean frea toy tayt their to so she prcedadn- and suym haw seare a vene ily. Polly wari. They were mad a wert to xime, the veasen" and grien to but furnyt want to sed talked a may.<br><br>The grid a foll. The "iling. %o loon a with ray fft. Tre sly call. He dayz away hoff the perter. The whone her to saibry. The smuny. She by timpy something to seve<br><br>fee the ground a cime then gratest ands and a with his wood! One day plays garond to curprasy was so ine of the back to take a smiled. Ore lefyor her!" Soras ecried. Jon the will and adpoly toor lantend his and story."</p></blockquote><p>In this post we will try to improve on this by introducing the recurrent neural network (RNN).  Before we do, it helps to remember how the model is working up to this point:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yvs7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yvs7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 424w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 848w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 1272w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yvs7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png" width="850" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:850,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yvs7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 424w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 848w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 1272w, https://substackcdn.com/image/fetch/$s_!yvs7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5b28924-31d8-47aa-acb4-2b2d4c1fad9a_850x654.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this model, we are concatenating <code>ctx_window</code> character embeddings into a single feature vector, then passing it through the fully connected layers. One problem with this design is the impact the <code>ctx_window</code> has on the number of trainable parameters. With 4 linear+layernorm+tanh layers, our model has 313,743 parameters. Currently, our <code>ctx_window</code> is only 8, meaning we only have a &#8220;history&#8221; of 8 characters. So we can&#8217;t really expect the model to be able to track dependencies in the character stream beyond the most recent 8 characters. If we want to increase the <code>ctx_window</code> to have a deeper &#8220;history&#8221;, the problem is that this blows up the number of parameters in the first linear layer, since its dimensions are <code>ctx_window * embed_dim</code>, <code>hidden_dim</code>). For example, increasing <code>ctx_window</code> from 8 to just 16 doubles the number of parameters of the first linear layer from 65536 to 131072. In other words, our current model doesn&#8217;t scale with the input length.</p><p>It would be better if the architecture were better suited to exploit the fact that text is sequential. When we want to generate the next character, ideally we can make that decision based on all the previous characters that we&#8217;ve generated so far, not just the 8 most recent characters. This is what recurrent neural networks (RNNs) help us do. They are explicitly designed to exploit the fact that text (among many other types of data) can be considered as a sequential stream of tokens. This is an example of an <em>inductive bias </em>that aids in learning. Inductive bias is another word for the assumptions we make about the data we are working with that help the model learn.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><h2>Vanilla RNNs</h2><p>There are many types of RNNs. Here we will just be looking at vanilla RNNs to get our feet wet. The main idea behind all RNNs is to include a loop in the architecture. Whereas the architecture above is a directed acyclic graph (DAG), an RNN has a loop. The loop keeps track of a so-called <em>hidden state. </em>The hidden state is the model&#8217;s representation of everything it has seen so far. The hidden state, along with the current input, are both included in the output of the RNN. Here is a high level view:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cPH9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cPH9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 424w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 848w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 1272w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cPH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png" width="767" height="809" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/850368df-9381-44f2-89fe-c0e86919d71e_767x809.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:809,&quot;width&quot;:767,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cPH9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 424w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 848w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 1272w, https://substackcdn.com/image/fetch/$s_!cPH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F850368df-9381-44f2-89fe-c0e86919d71e_767x809.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Two things to note. One is that the sequence is processed one token at a time <code>t=0,t=1,t=2</code> instead of being concatenated. Two is that the hidden state <code>h_t</code> is a function of the previous hidden state <code>h_t-1</code> and the current token <code>x_t</code>. The output of the RNN <code>o_t</code> is a final linear layer to get us back to the vocab_size dimension. Mathematically we can write the hidden state h_t and output o_t as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\nh_t &amp;= tanh(x_tW_x + h_{t-1}W_h) \\\\\no_t &amp;= h_t W_o\n\\end{align*}&quot;,&quot;id&quot;:&quot;PQERERUKAT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <code>x_t</code> is the embedding of the current token and <code>W_x</code> and <code>W_h</code> are learned weight matrices. Of course other activations can be used besides tanh, and you can add bias to the input and/or hidden state terms if you want. The important thing to note is the weights <code>W_x</code>, <code>W_h</code>, and <code>W_o</code> are shared across each token t. This enables us to crank up the <code>ctx_window</code> without increasing the parameter count. </p><p>Even though in theory we can increase the <code>ctx_window</code> to any value, we still can&#8217;t in practice. One reason is that we need a constant value for fixed-width batches during training. Another is related to the training dynamics from Part 2. The main issue is the recurrent multiplication across the time dimension causes gradient instability. We will see exactly why that is later.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-3-vanilla?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-3-vanilla?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>You can find the implementation for the RNN<a href="https://colab.research.google.com/drive/1UDS14UWxNW993TbkvaLEP8nAfZimiaDa?usp=sharing"> here</a>. To see the impact of scaling up the context window (also called the sequence length; <code>seq_len</code> in the code), we can train the model with different context windows to see how it affects loss, perplexity, and story generation.</p><p>Here are the validation losses for sequence lengths 16, 32, and 64:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2d9M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2d9M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 424w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 848w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 1272w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2d9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png" width="1242" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1242,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2d9M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 424w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 848w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 1272w, https://substackcdn.com/image/fetch/$s_!2d9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f53d5-32e5-43cb-94fc-4f971dafabd1_1242x787.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model with seq_len=16 had a final loss of 1.59, perplexity of 4.53. The model with seq_len=32 had a final loss of 1.47 and perplexity 4.47. The model with seq_len=64 had a final loss of 1.54 and perplexity of 4.48. Here are some example stories from each:</p><h4>seq_len=16</h4><blockquote><p>Story time: Once upon a little Tom and his Day had dogo saed Aaded soog other to see sadint it she day liked redg feall She with his mack the funny, You of frold. He stark. They so the was byathinks fun decape iserswing an was and asked wike to nameed for a smill!&#8221; veiave hard a with a liendss time, jat man in the upon you west fish as?&#8221;</p><p>AThan&#8217;ur work. He park and the is hear to him was s ruchid. Everyone the phe fax friendsing in and cparial and dogethereot dact get loved happy. She girl ong and meace ave there agall find the park. </p></blockquote><h4>seq_len=32</h4><blockquote><p>Story time: Once upon a litt. They ging, to gelt did not ie so her saig.</p><p>A&#168;n the little they her and reing. red to all she with his mach the fun. They cound shink inst I cur they go bot he fathin?&#8221; </p><p>Hed cap on hasse for with and asket with then she with a beecsted to see are a wase af fellss is histar was so had ut him the sto telt the be his und scart to dell big to think.&#8221; Lily like cid nory to tigen a toy.</p></blockquote><h4>seq_len=64</h4><blockquote><p>Story time: Once upon a little eazings ont Day, I kidgo preate it said. At and you were ittlied. The dreing frien fell backy. She is a camed an. Dadry, Tommy said, &#8220;Now, Timmy. He good they fate ause wime caplo. Shew it a with and purnt wike to na;e and hoar. Her and it was he a wourhan. Bobse time, tak mom in the mom togetecide is the bout his mommy, they prele big to think. The rail a citcus but the ponf the she friendsn. The finger ick will and do sturted the tallo!&#8221; Biknywnedded usever,o gand and ce oven tore, girl flew the parn. </p></blockquote><p>Clearly we aren&#8217;t seeing an advantage in increasing the sequence length on performance. The losses and perplexities are still within a small range of each other, and the sampled stories are all within the same neighborhood of not good. What&#8217;s going on?</p><h2>The Curse of Recurrence</h2><p>In theory, RNNs are supposed to provide a nice inductive bias for sequential data by maintaining a state that records the &#8220;history&#8221; of the sequence seen so far. In practice though, they are difficult to scale to long sequence lengths due to unstable gradients. The root cause is in the history itself, specifically the recurrent weight multiplication. To see this we can derive the gradient of the loss with respect to the hidden weights <code>W_h</code>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\frac{\\partial L}{\\partial W_h} &amp;= \\sum_{t=1}^{T}\\frac{\\partial L}{\\partial h_t}h_{t-1}^{\\top} \\\\\n\\frac{\\partial L}{\\partial h_t} &amp;= \\sum_{i=t}^{T}(W_{h}^{\\top})^{T - i}W_{o}^{\\top}\\frac{\\partial L}{\\partial o_{T+t-i}}\n\\end{align*}&quot;,&quot;id&quot;:&quot;BVBQWHVDAF&quot;}" data-component-name="LatexBlockToDOM"></div><p>You can find the full derivation <a href="https://d2l.ai/chapter_recurrent-neural-networks/bptt.html">here</a>. The important term in the above is the <code>W_h</code> which is raised to the power of <code>T-i</code>. If we assume that <code>W_h</code> is diagonalizable<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, then we can express <code>T-i</code> powers of <code>W_h</code> as the eigendecomposition</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W_h^{T-i} = P\\Sigma^{T-i} P^{-1}&quot;,&quot;id&quot;:&quot;ASEGQLYWHD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where the middle matrix sigma is the diagonal matrix containing the eigenvalues of W_h. The effect that this has on an input is to stretch it in the direction of each eigenvector by <code>T-i</code> factors of the corresponding eigvenvalue. This means that as <code>T</code> increases, the matrix <code>W_h</code> will pull the input towards the principal eigenvector, and the magnitude will blow up (if the largest eigenvalue is greater than 1), shrink to zero (if the largest eigenvalue is less than 1), or stay about the same (if the largest eigenvalue is equal to one). </p><p>The first two cases explain the exploding and vanishing gradient phenomena present as we try to increase the sequence length <code>T</code>. We can see this in the gradient histograms of the hidden weight <code>W_h</code> measured at the last token index T halfway through each training run:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bEr0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bEr0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bEr0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7763eb64-987c-4b61-a270-e17467124db1_1500x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42897,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bEr0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!bEr0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7763eb64-987c-4b61-a270-e17467124db1_1500x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As the sequence length increases, the distribution of the hidden gradients converges to a sharper peak at zero. This problem is the primary motivation behind <a href="https://docs.pytorch.org/cppdocs/api/function_namespacetorch_1_1nn_1_1init_1a5978fcc257460475f635b5960e892a8e.html#_CPPv4N5torch2nn4init11orthogonal_E6Tensord">orthogonal weight initialization</a>. Initializing the recurrent weights as an orthogonal matrix ensures the norm of its eigenvalues is 1, preventing an exploding or vanishing magnitude in the transformation of the input. </p><p>We can visualize this with a simple example to gain some more intuition. We will initialize a random square matrix W and a vector v, and multiply v by W for several iterations. Then we can plot the norm of v across the iterations:</p><pre><code>W = torch.rand(3, 3, dtype=torch.float64)
v = torch.rand(3, 1, dtype=torch.float64)
eigvals = torch.linalg.eigvals(W)
eignorms = torch.abs(eigvals)
eignorms
...
tensor([1.1060, 1.1060, 1.6240], dtype=torch.float64)</code></pre><p>The principal eigenvalue has magnitude 1.62, so we expect the norms to explode after many iterations:</p><pre><code>v_norms = []

for i in range(100):
    v = W @ v
    v_norms.append(torch.norm(v).item())

plt.plot(torch.arange(0, 100), v_norms)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QRd3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QRd3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 424w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 848w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 1272w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QRd3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png" width="556" height="427.0625766871166" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:815,&quot;resizeWidth&quot;:556,&quot;bytes&quot;:31907,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QRd3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 424w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 848w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 1272w, https://substackcdn.com/image/fetch/$s_!QRd3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a474957-13f9-43d3-912e-053e3e008d8b_815x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now if we use orthogonal initialization on W, we can perform the same iteration, and the magnitude will be constant:</p><pre><code>W = torch.randn(3, 3, dtype=torch.float64)
W = torch.nn.init.orthogonal_(W)
eigvals = torch.linalg.eigvals(W)
eignorms = torch.abs(eigvals)
...
tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)</code></pre><p>And the plot of the norms is flat as expected:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MLDV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MLDV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 424w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 848w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 1272w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MLDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png" width="559" height="429.36687116564417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:815,&quot;resizeWidth&quot;:559,&quot;bytes&quot;:29628,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MLDV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 424w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 848w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 1272w, https://substackcdn.com/image/fetch/$s_!MLDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd585bc2c-5156-4b0f-96f3-dc9fbf8fd498_815x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Finally, whenever the principal eigenvalue has magnitude less than one, we see the norm vanish:</p><pre><code>W = torch.randn(3, 3, dtype=torch.float64)
W = W * 0.1 # scale down to get smaller eigenvalues
eigvals = torch.linalg.eigvals(W)
eignorms = torch.abs(eigvals)
...
tensor([0.2711, 0.0049, 0.0449], dtype=torch.float64)</code></pre><p>The principal eigenvalue is 0.2711. Here is the plot:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!28jX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!28jX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 424w, https://substackcdn.com/image/fetch/$s_!28jX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 848w, https://substackcdn.com/image/fetch/$s_!28jX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 1272w, https://substackcdn.com/image/fetch/$s_!28jX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!28jX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png" width="547" height="399.7307692307692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:832,&quot;resizeWidth&quot;:547,&quot;bytes&quot;:34450,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/184210105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!28jX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 424w, https://substackcdn.com/image/fetch/$s_!28jX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 848w, https://substackcdn.com/image/fetch/$s_!28jX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 1272w, https://substackcdn.com/image/fetch/$s_!28jX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49e7de32-88b6-4a1e-a4c3-bb984483df0d_832x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Of course, this is a simplified scenario, so real RNNs will have dynamics that are much more noisy than these, however these provide a good intuition for what is happening. Most of this noise comes from the other terms of the gradient, plus the fact that the weight matrices themselves are modified after each backwards pass (assuming they get a gradient). Like we saw in <a href="https://substack.com/@connorjdavis/p-183195738">Part 2</a>, the values of the weights are dynamic, so special initialization techniques like orthogonal init only takes us so far. This is why architectural improvements to the vanilla RNN have been pursued, the most notable of which is the <em>long short-term memory</em> (LSTM), which we will see on the next post.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>If it is not, then a similar eigenvalue analysis applies, but with the Jordan normal form rather than the diagonal form.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 2: Training Dynamics]]></title><description><![CDATA[Part 2 of a series on language modeling with neural networks.]]></description><link>https://www.connorjdavis.com/p/language-modeling-part-2-training</link><guid isPermaLink="false">https://www.connorjdavis.com/p/language-modeling-part-2-training</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Thu, 08 Jan 2026 17:35:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!4ZDZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is Part 2 of a series on language modeling with neural networks. You can find Part 1 <a href="https://connorjdavis.substack.com/p/tour-de-language-modeling-part-1">here</a>. If you recall from Part 1, our initial model had a perplexity of 434 on the hold-out set from TinyStories. This was a story sampled from the model:</p><blockquote><p>Story time: Once upon a timk.e an ter pere eo hire sores the caed va boanshThr witlin. HtA. ThengerDpg, and ker ditgy und g nit, tayeur Vag anddbT&#201;dbvjogd isswarp!e wow,e. ouancs.&#8221;Tneyd-4%un6&#184;&#164;&#338;&#194;&#183;&#175; } Iy&#382;+&#8225;+&#8250;&#180;&#162;&#191;D&#187;&#225;jf&#201;&#381;&#176;&#233;G&#173;&#8482;yz&#8250;1&#338;&#194;&#353;&#175;&#187;{U9&#172;#&#179;&#8217;} %&gt;&#178;)&#184;&#8216;&#172;#&#339;j;&#202;q&gt;&#8216;&#230;&#201;&#181;Lb&#230;&#228;c&#174;&#232;.c&#381;39&#176;zc&#183;dxnomd.&#402;&gt;o&#166;t.mTe su&#338;lmvcyI&#162;&#8221;D&#225;&#339;&#8211;j&#339;&#179;;&#191;&#228;X&#233;cv&#8482;&#166;R&#402;&#184;2&#8217;F&#8249; @&#8250;&#8221;&#402;&#195;&#8250;6&#177;z&#353;&lt;&#176;b&#201;;&#174;&#174;&#210;`0 ?.&#196;#2&#187;&#225;B&#8221;&#183;&#8221;&#226;2&#180;F&#185;&#8230;&#165;&#174;@12&#167;9\&gt;&#710;&#167;&#163;V}&#229;4&#185;&#8364;F&#233;Q}&#166;&#169;&#161;&#168;&#177;&#188;&#175;&#8224;&#195;))`&#188;&#201;\Rz&#228;&#161;\&#172;#;&#179;Y&#376;&#176;vVL&#226;%&#196;&lt;Z&#230;&#179;&#175;&#233;O&#8218;&#195;M&#382;+`[&#8221;&#230;C&#226;j,C&#209;S&#352;\,&#185; ]O&#226;&#172;&#732;&lt;!&#230;&#230;&#210;&#175;Y&#230;&#161;&#710;9&#202;&#239;4g$&#189;?&#196;b&#239;&#201;?oBH&#732;&#228;&#177; ;&#227;R&gt;@)&#402;&#8240;&#710;=X&#240;&#165;&#185;P,?0=&gt;&#381;&#240;:&#8221;QW&#176;JFxQ(3\h&#8222;&#352;&#240;&#201;)X&#732;&#180;QD&#181;xj&#187;.&#162;&#201;?&#353;&#172;&#170;Rc&#179;&#352;&#239;&#352;&#172;&#173;qU&#162;E&#185;&#162;&#339;R0&#8240;2&#376;&#240;:&#381;+&#197;4&#161;&#186;^</p></blockquote><p>As you can see, the model is pretty terrible right now. In this post, we will try to improve the performance of the model by scaling it up - adding more layers and hence representational capacity to the model. However, we will see that simply adding layers can cause instability in training. We will look at some basic techniques for addressing this instability and improving performance. The notebook for this post can be found at <a href="https://colab.research.google.com/drive/1r4sDMqphipA4y4WYRJz2fzyJ8FV9qPOa?usp=sharing">char-bengio-dynamics.ipynb</a> if you want to follow along with code.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/subscribe?"><span>Subscribe now</span></a></p><h2>Adding Layers</h2><p>The model from Part 1 uses just <a href="https://connorjdavis.substack.com/i/183082981/building-the-training-loop">a single Linear layer</a> with tanh activation. By adding more layers, we add parameters, and give the network more representational capacity. In theory this should let our model learn better (at the risk of overfitting to the training set). Let&#8217;s try using 4 linear+tanh layers to the base model:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144136806\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bengio-deep-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bengio-deep.py content, created by cjams on 09:41PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bengio-deep.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>model</span> <span class=pl-c1>=</span> [</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Embedding</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>num_embeddings</span><span class=pl-c1>=</span><span class=pl-s1>vocab_size</span>, <span class=pl-s1>embedding_dim</span><span class=pl-c1>=</span><span class=pl-s1>embed_dim</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Flatten</span>(<span class=pl-s1>input_dim1</span><span class=pl-c1>=</span><span class=pl-s1>ctx_window</span>, <span class=pl-s1>input_dim2</span><span class=pl-c1>=</span><span class=pl-s1>embed_dim</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>ctx_window</span><span class=pl-c1>*</span><span class=pl-s1>embed_dim</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Tanh</span>(),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Tanh</span>(),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Tanh</span>(),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Tanh</span>(),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>vocab_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>params</span> <span class=pl-c1>=</span> [<span class=pl-s1>p</span> <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span> <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>layer</span>.<span class=pl-c1>params</span>()]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Enable gradients for the learnable parameters</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>p</span>.<span class=pl-c1>requires_grad</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Create RNG</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-deep-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-bengio-deep-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>g</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>Generator</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>).<span class=pl-c1>manual_seed</span>(<span class=pl-c1>42</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/420e2f07093bd44385a035476cea11e1/raw/e03d6b388693de9ccdc7801d5e8a05b3767d20a9/bengio-deep.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/420e2f07093bd44385a035476cea11e1#file-bengio-deep-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bengio-deep.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144136806" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bengio-deep-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="bengio-deep.py">
        <tbody><tr>
          <td id="file-bengio-deep-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bengio-deep-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">model</span> <span class="pl-c1">=</span> [</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bengio-deep-py-LC2" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Embedding</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">num_embeddings</span><span class="pl-c1">=</span><span class="pl-s1">vocab_size</span>, <span class="pl-s1">embedding_dim</span><span class="pl-c1">=</span><span class="pl-s1">embed_dim</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bengio-deep-py-LC3" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Flatten</span>(<span class="pl-s1">input_dim1</span><span class="pl-c1">=</span><span class="pl-s1">ctx_window</span>, <span class="pl-s1">input_dim2</span><span class="pl-c1">=</span><span class="pl-s1">embed_dim</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bengio-deep-py-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">ctx_window</span><span class="pl-c1">*</span><span class="pl-s1">embed_dim</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bengio-deep-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Tanh</span>(),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bengio-deep-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bengio-deep-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Tanh</span>(),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bengio-deep-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bengio-deep-py-LC9" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Tanh</span>(),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bengio-deep-py-LC10" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bengio-deep-py-LC11" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Tanh</span>(),</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bengio-deep-py-LC12" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">vocab_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bengio-deep-py-LC13" class="blob-code blob-code-inner js-file-line">]</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bengio-deep-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bengio-deep-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">params</span> <span class="pl-c1">=</span> [<span class="pl-s1">p</span> <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span> <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">layer</span>.<span class="pl-c1">params</span>()]</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bengio-deep-py-LC16" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bengio-deep-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Enable gradients for the learnable parameters</span></td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bengio-deep-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bengio-deep-py-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">p</span>.<span class="pl-c1">requires_grad</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span></td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bengio-deep-py-LC20" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-bengio-deep-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Create RNG</span></td>
        </tr>
        <tr>
          <td id="file-bengio-deep-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-bengio-deep-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">g</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">Generator</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>).<span class="pl-c1">manual_seed</span>(<span class="pl-c1">42</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/420e2f07093bd44385a035476cea11e1/raw/e03d6b388693de9ccdc7801d5e8a05b3767d20a9/bengio-deep.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/420e2f07093bd44385a035476cea11e1#file-bengio-deep-py" class="Link--inTextBlock">
          bengio-deep.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>This brings the number of parameters up from 116,912 to 313,568. I modified the training loop slightly to log the gradient values of each layer every 20,000 training steps:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144136881\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bengio-train-with-grads-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bengio-train-with-grads.py content, created by cjams on 09:45PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bengio-train-with-grads.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>%</span><span class=pl-s1>matplotlib</span> <span class=pl-s1>inline</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>def</span> <span class=pl-en>plot_loss</span>(<span class=pl-s1>trn_loss</span>, <span class=pl-s1>val_loss</span><span class=pl-c1>=</span><span class=pl-c1>None</span>, <span class=pl-s1>title</span><span class=pl-c1>=</span><span class=pl-s>&amp;quot;Loss Curves&amp;quot;</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>figure</span>(<span class=pl-s1>figsize</span><span class=pl-c1>=</span>(<span class=pl-c1>10</span>, <span class=pl-c1>6</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>xticks</span>(<span class=pl-s1>fontsize</span><span class=pl-c1>=</span><span class=pl-c1>12</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>yticks</span>(<span class=pl-s1>fontsize</span><span class=pl-c1>=</span><span class=pl-c1>12</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>title</span>(<span class=pl-s1>title</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>legends</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>assert</span> <span class=pl-en>len</span>(<span class=pl-s1>trn_loss</span>) <span class=pl-c1>%</span> <span class=pl-c1>1000</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>plot</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-s1>trn_loss</span>).<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-c1>1000</span>).<span class=pl-c1>mean</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>legends</span>.<span class=pl-c1>append</span>(<span class=pl-s>&amp;quot;train loss&amp;quot;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>val_loss</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>plt</span>.<span class=pl-c1>plot</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-s1>val_loss</span>).<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-c1>1000</span>).<span class=pl-c1>mean</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>legends</span>.<span class=pl-c1>append</span>(<span class=pl-s>&amp;quot;val loss&amp;quot;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>legend</span>(<span class=pl-s1>legends</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>show</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Training loop</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>trn_loss</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>val_loss</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>def</span> <span class=pl-en>train</span>(<span class=pl-s1>device</span>, <span class=pl-s1>max_step</span>, <span class=pl-v>X_trn</span>, <span class=pl-v>Y_trn</span>, <span class=pl-v>X_val</span>, <span class=pl-v>Y_val</span>, <span class=pl-s1>batch_size</span>, <span class=pl-s1>g</span>, <span class=pl-s1>model</span>, <span class=pl-s1>params</span>, <span class=pl-s1>lr</span>, <span class=pl-s1>trn_loss</span>, <span class=pl-s1>val_loss</span>, <span class=pl-s1>with_grad</span><span class=pl-c1>=</span><span class=pl-c1>False</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>grads</span> <span class=pl-c1>=</span> {}</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>i</span> <span class=pl-c1>in</span> <span class=pl-en>range</span>(<span class=pl-s1>max_step</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>ix</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randint</span>(<span class=pl-c1>0</span>, <span class=pl-v>X_trn</span>.<span class=pl-c1>shape</span>[<span class=pl-c1>0</span>], (<span class=pl-s1>batch_size</span>,), <span class=pl-s1>generator</span><span class=pl-c1>=</span><span class=pl-s1>g</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-v>X_trn</span>[<span class=pl-s1>ix</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Forward pass</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Compute loss</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>cross_entropy</span>(<span class=pl-s1>x</span>, <span class=pl-v>Y_trn</span>[<span class=pl-s1>ix</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>trn_loss</span>.<span class=pl-c1>append</span>(<span class=pl-s1>loss</span>.<span class=pl-c1>item</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Retain all gradients for visualization</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>with_grad</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>layer</span>.<span class=pl-c1>out</span>.<span class=pl-c1>retain_grad</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Zero gradients to prevent accumulation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>p</span>.<span class=pl-c1>grad</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Backpropagation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>loss</span>.<span class=pl-c1>backward</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>i</span> <span class=pl-c1>&amp;gt;</span> <span class=pl-c1>80000</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>lr</span> <span class=pl-c1>=</span> <span class=pl-c1>1e-4</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Update params</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>p</span>.<span class=pl-c1>data</span> <span class=pl-c1>+=</span> <span class=pl-c1>-</span><span class=pl-s1>lr</span> <span class=pl-c1>*</span> <span class=pl-s1>p</span>.<span class=pl-c1>grad</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Copy gradients for visualizations</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>with_grad</span> <span class=pl-c1>and</span> <span class=pl-s1>i</span> <span class=pl-c1>%</span> <span class=pl-c1>20000</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L62\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;62\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC62\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>for</span> <span class=pl-s1>j</span>, <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>model</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L63\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;63\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC63\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-k>if</span> <span class=pl-s1>j</span> <span class=pl-c1><span class=pl-c1>not</span> <span class=pl-c1>in</span></span> <span class=pl-s1>grads</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L64\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;64\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC64\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                    <span class=pl-s1>grads</span>[<span class=pl-s1>j</span>] <span class=pl-c1>=</span> {}</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L65\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;65\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC65\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>grads</span>[<span class=pl-s1>j</span>][<span class=pl-s1>i</span>] <span class=pl-c1>=</span> <span class=pl-s1>layer</span>.<span class=pl-c1>out</span>.<span class=pl-c1>grad</span>.<span class=pl-c1>cpu</span>().<span class=pl-c1>tolist</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L66\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;66\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC66\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L67\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;67\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC67\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># Validation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L68\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;68\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC68\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>with</span> <span class=pl-s1>torch</span>.<span class=pl-c1>no_grad</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L69\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;69\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC69\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>ix</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randint</span>(<span class=pl-c1>0</span>, <span class=pl-v>X_val</span>.<span class=pl-c1>shape</span>[<span class=pl-c1>0</span>], (<span class=pl-s1>batch_size</span>,), <span class=pl-s1>generator</span><span class=pl-c1>=</span><span class=pl-s1>g</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L70\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;70\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC70\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-v>X_val</span>[<span class=pl-s1>ix</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L71\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;71\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC71\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L72\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;72\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC72\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L73\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;73\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC73\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L74\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;74\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC74\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L75\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;75\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC75\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>cross_entropy</span>(<span class=pl-s1>x</span>, <span class=pl-v>Y_val</span>[<span class=pl-s1>ix</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L76\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;76\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC76\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>val_loss</span>.<span class=pl-c1>append</span>(<span class=pl-s1>loss</span>.<span class=pl-c1>item</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L77\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;77\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC77\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L78\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;78\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC78\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>i</span> <span class=pl-c1>%</span> <span class=pl-c1>10000</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L79\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;79\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC79\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-en>print</span>(<span class=pl-s>f&amp;quot;step <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>i</span>:7d<span class=pl-kos>}</span></span> | train loss <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>trn_loss</span>[<span class=pl-c1>-</span><span class=pl-c1>1</span>]:.4f<span class=pl-kos>}</span></span> | val loss <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>val_loss</span>[<span class=pl-c1>-</span><span class=pl-c1>1</span>]:.4f<span class=pl-kos>}</span></span>&amp;quot;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L80\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;80\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC80\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-with-grads-py-L81\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;81\&quot;></td>\n          <td id=\&quot;file-bengio-train-with-grads-py-LC81\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>return</span> <span class=pl-s1>trn_loss</span>, <span class=pl-s1>val_loss</span>, <span class=pl-s1>grads</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/b9c6a1f614f2986d462eddfdfc457d50/raw/c5d637dbfc30ffe03211cae4ea506752577c8a38/bengio-train-with-grads.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/b9c6a1f614f2986d462eddfdfc457d50#file-bengio-train-with-grads-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bengio-train-with-grads.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144136881" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bengio-train-with-grads-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="bengio-train-with-grads.py">
        <tbody><tr>
          <td id="file-bengio-train-with-grads-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bengio-train-with-grads-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">%</span><span class="pl-s1">matplotlib</span> <span class="pl-s1">inline</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bengio-train-with-grads-py-LC2" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bengio-train-with-grads-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">def</span> <span class="pl-en">plot_loss</span>(<span class="pl-s1">trn_loss</span>, <span class="pl-s1">val_loss</span><span class="pl-c1">=</span><span class="pl-c1">None</span>, <span class="pl-s1">title</span><span class="pl-c1">=</span><span class="pl-s">"Loss Curves"</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bengio-train-with-grads-py-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">figure</span>(<span class="pl-s1">figsize</span><span class="pl-c1">=</span>(<span class="pl-c1">10</span>, <span class="pl-c1">6</span>))</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bengio-train-with-grads-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">xticks</span>(<span class="pl-s1">fontsize</span><span class="pl-c1">=</span><span class="pl-c1">12</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bengio-train-with-grads-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">yticks</span>(<span class="pl-s1">fontsize</span><span class="pl-c1">=</span><span class="pl-c1">12</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bengio-train-with-grads-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">title</span>(<span class="pl-s1">title</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bengio-train-with-grads-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bengio-train-with-grads-py-LC9" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">legends</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bengio-train-with-grads-py-LC10" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bengio-train-with-grads-py-LC11" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">assert</span> <span class="pl-en">len</span>(<span class="pl-s1">trn_loss</span>) <span class="pl-c1">%</span> <span class="pl-c1">1000</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bengio-train-with-grads-py-LC12" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">plot</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-s1">trn_loss</span>).<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-c1">1000</span>).<span class="pl-c1">mean</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bengio-train-with-grads-py-LC13" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">legends</span>.<span class="pl-c1">append</span>(<span class="pl-s">"train loss"</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bengio-train-with-grads-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bengio-train-with-grads-py-LC15" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">val_loss</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bengio-train-with-grads-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">plt</span>.<span class="pl-c1">plot</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-s1">val_loss</span>).<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-c1">1000</span>).<span class="pl-c1">mean</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bengio-train-with-grads-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">legends</span>.<span class="pl-c1">append</span>(<span class="pl-s">"val loss"</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bengio-train-with-grads-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bengio-train-with-grads-py-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">legend</span>(<span class="pl-s1">legends</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bengio-train-with-grads-py-LC20" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">show</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-bengio-train-with-grads-py-LC21" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-bengio-train-with-grads-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Training loop</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-bengio-train-with-grads-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">trn_loss</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-bengio-train-with-grads-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">val_loss</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-bengio-train-with-grads-py-LC25" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-bengio-train-with-grads-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-k">def</span> <span class="pl-en">train</span>(<span class="pl-s1">device</span>, <span class="pl-s1">max_step</span>, <span class="pl-v">X_trn</span>, <span class="pl-v">Y_trn</span>, <span class="pl-v">X_val</span>, <span class="pl-v">Y_val</span>, <span class="pl-s1">batch_size</span>, <span class="pl-s1">g</span>, <span class="pl-s1">model</span>, <span class="pl-s1">params</span>, <span class="pl-s1">lr</span>, <span class="pl-s1">trn_loss</span>, <span class="pl-s1">val_loss</span>, <span class="pl-s1">with_grad</span><span class="pl-c1">=</span><span class="pl-c1">False</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-bengio-train-with-grads-py-LC27" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">grads</span> <span class="pl-c1">=</span> {}</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-bengio-train-with-grads-py-LC28" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-bengio-train-with-grads-py-LC29" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">i</span> <span class="pl-c1">in</span> <span class="pl-en">range</span>(<span class="pl-s1">max_step</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-bengio-train-with-grads-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">ix</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randint</span>(<span class="pl-c1">0</span>, <span class="pl-v">X_trn</span>.<span class="pl-c1">shape</span>[<span class="pl-c1">0</span>], (<span class="pl-s1">batch_size</span>,), <span class="pl-s1">generator</span><span class="pl-c1">=</span><span class="pl-s1">g</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-bengio-train-with-grads-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-v">X_trn</span>[<span class="pl-s1">ix</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-bengio-train-with-grads-py-LC32" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-bengio-train-with-grads-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Forward pass</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-bengio-train-with-grads-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-bengio-train-with-grads-py-LC35" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-bengio-train-with-grads-py-LC36" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-bengio-train-with-grads-py-LC37" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Compute loss</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-bengio-train-with-grads-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">cross_entropy</span>(<span class="pl-s1">x</span>, <span class="pl-v">Y_trn</span>[<span class="pl-s1">ix</span>])</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-bengio-train-with-grads-py-LC39" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">trn_loss</span>.<span class="pl-c1">append</span>(<span class="pl-s1">loss</span>.<span class="pl-c1">item</span>())</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-bengio-train-with-grads-py-LC40" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-bengio-train-with-grads-py-LC41" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Retain all gradients for visualization</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-bengio-train-with-grads-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">with_grad</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-bengio-train-with-grads-py-LC43" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-bengio-train-with-grads-py-LC44" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">layer</span>.<span class="pl-c1">out</span>.<span class="pl-c1">retain_grad</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-bengio-train-with-grads-py-LC45" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-bengio-train-with-grads-py-LC46" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Zero gradients to prevent accumulation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-bengio-train-with-grads-py-LC47" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-bengio-train-with-grads-py-LC48" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">p</span>.<span class="pl-c1">grad</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-bengio-train-with-grads-py-LC49" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-bengio-train-with-grads-py-LC50" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Backpropagation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-bengio-train-with-grads-py-LC51" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">loss</span>.<span class="pl-c1">backward</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-bengio-train-with-grads-py-LC52" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-bengio-train-with-grads-py-LC53" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">i</span> <span class="pl-c1">&gt;</span> <span class="pl-c1">80000</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-bengio-train-with-grads-py-LC54" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">lr</span> <span class="pl-c1">=</span> <span class="pl-c1">1e-4</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-bengio-train-with-grads-py-LC55" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-bengio-train-with-grads-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Update params</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-bengio-train-with-grads-py-LC57" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-bengio-train-with-grads-py-LC58" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">p</span>.<span class="pl-c1">data</span> <span class="pl-c1">+=</span> <span class="pl-c1">-</span><span class="pl-s1">lr</span> <span class="pl-c1">*</span> <span class="pl-s1">p</span>.<span class="pl-c1">grad</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-bengio-train-with-grads-py-LC59" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-bengio-train-with-grads-py-LC60" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Copy gradients for visualizations</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-bengio-train-with-grads-py-LC61" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">with_grad</span> <span class="pl-c1">and</span> <span class="pl-s1">i</span> <span class="pl-c1">%</span> <span class="pl-c1">20000</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L62" class="blob-num js-line-number js-blob-rnum" data-line-number="62"></td>
          <td id="file-bengio-train-with-grads-py-LC62" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">for</span> <span class="pl-s1">j</span>, <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">model</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L63" class="blob-num js-line-number js-blob-rnum" data-line-number="63"></td>
          <td id="file-bengio-train-with-grads-py-LC63" class="blob-code blob-code-inner js-file-line">                <span class="pl-k">if</span> <span class="pl-s1">j</span> <span class="pl-c1"><span class="pl-c1">not</span> <span class="pl-c1">in</span></span> <span class="pl-s1">grads</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L64" class="blob-num js-line-number js-blob-rnum" data-line-number="64"></td>
          <td id="file-bengio-train-with-grads-py-LC64" class="blob-code blob-code-inner js-file-line">                    <span class="pl-s1">grads</span>[<span class="pl-s1">j</span>] <span class="pl-c1">=</span> {}</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L65" class="blob-num js-line-number js-blob-rnum" data-line-number="65"></td>
          <td id="file-bengio-train-with-grads-py-LC65" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">grads</span>[<span class="pl-s1">j</span>][<span class="pl-s1">i</span>] <span class="pl-c1">=</span> <span class="pl-s1">layer</span>.<span class="pl-c1">out</span>.<span class="pl-c1">grad</span>.<span class="pl-c1">cpu</span>().<span class="pl-c1">tolist</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L66" class="blob-num js-line-number js-blob-rnum" data-line-number="66"></td>
          <td id="file-bengio-train-with-grads-py-LC66" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L67" class="blob-num js-line-number js-blob-rnum" data-line-number="67"></td>
          <td id="file-bengio-train-with-grads-py-LC67" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># Validation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L68" class="blob-num js-line-number js-blob-rnum" data-line-number="68"></td>
          <td id="file-bengio-train-with-grads-py-LC68" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">with</span> <span class="pl-s1">torch</span>.<span class="pl-c1">no_grad</span>():</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L69" class="blob-num js-line-number js-blob-rnum" data-line-number="69"></td>
          <td id="file-bengio-train-with-grads-py-LC69" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">ix</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randint</span>(<span class="pl-c1">0</span>, <span class="pl-v">X_val</span>.<span class="pl-c1">shape</span>[<span class="pl-c1">0</span>], (<span class="pl-s1">batch_size</span>,), <span class="pl-s1">generator</span><span class="pl-c1">=</span><span class="pl-s1">g</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L70" class="blob-num js-line-number js-blob-rnum" data-line-number="70"></td>
          <td id="file-bengio-train-with-grads-py-LC70" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-v">X_val</span>[<span class="pl-s1">ix</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L71" class="blob-num js-line-number js-blob-rnum" data-line-number="71"></td>
          <td id="file-bengio-train-with-grads-py-LC71" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L72" class="blob-num js-line-number js-blob-rnum" data-line-number="72"></td>
          <td id="file-bengio-train-with-grads-py-LC72" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L73" class="blob-num js-line-number js-blob-rnum" data-line-number="73"></td>
          <td id="file-bengio-train-with-grads-py-LC73" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L74" class="blob-num js-line-number js-blob-rnum" data-line-number="74"></td>
          <td id="file-bengio-train-with-grads-py-LC74" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L75" class="blob-num js-line-number js-blob-rnum" data-line-number="75"></td>
          <td id="file-bengio-train-with-grads-py-LC75" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">cross_entropy</span>(<span class="pl-s1">x</span>, <span class="pl-v">Y_val</span>[<span class="pl-s1">ix</span>])</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L76" class="blob-num js-line-number js-blob-rnum" data-line-number="76"></td>
          <td id="file-bengio-train-with-grads-py-LC76" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">val_loss</span>.<span class="pl-c1">append</span>(<span class="pl-s1">loss</span>.<span class="pl-c1">item</span>())</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L77" class="blob-num js-line-number js-blob-rnum" data-line-number="77"></td>
          <td id="file-bengio-train-with-grads-py-LC77" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L78" class="blob-num js-line-number js-blob-rnum" data-line-number="78"></td>
          <td id="file-bengio-train-with-grads-py-LC78" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">i</span> <span class="pl-c1">%</span> <span class="pl-c1">10000</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L79" class="blob-num js-line-number js-blob-rnum" data-line-number="79"></td>
          <td id="file-bengio-train-with-grads-py-LC79" class="blob-code blob-code-inner js-file-line">            <span class="pl-en">print</span>(<span class="pl-s">f"step <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">i</span>:7d<span class="pl-kos">}</span></span> | train loss <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">trn_loss</span>[<span class="pl-c1">-</span><span class="pl-c1">1</span>]:.4f<span class="pl-kos">}</span></span> | val loss <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">val_loss</span>[<span class="pl-c1">-</span><span class="pl-c1">1</span>]:.4f<span class="pl-kos">}</span></span>"</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L80" class="blob-num js-line-number js-blob-rnum" data-line-number="80"></td>
          <td id="file-bengio-train-with-grads-py-LC80" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-with-grads-py-L81" class="blob-num js-line-number js-blob-rnum" data-line-number="81"></td>
          <td id="file-bengio-train-with-grads-py-LC81" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> <span class="pl-s1">trn_loss</span>, <span class="pl-s1">val_loss</span>, <span class="pl-s1">grads</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/b9c6a1f614f2986d462eddfdfc457d50/raw/c5d637dbfc30ffe03211cae4ea506752577c8a38/bengio-train-with-grads.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/b9c6a1f614f2986d462eddfdfc457d50#file-bengio-train-with-grads-py" class="Link--inTextBlock">
          bengio-train-with-grads.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Running this for 100k training steps bring the final validation loss down from 6.4 to 4.4. Here are the loss curves:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_LZe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_LZe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 424w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 848w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 1272w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_LZe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png" width="817" height="533" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:533,&quot;width&quot;:817,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_LZe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 424w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 848w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 1272w, https://substackcdn.com/image/fetch/$s_!_LZe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c1a318-4dca-4867-8289-a9ab3e8a137d_817x533.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It also lowered the perplexity from 424 to 22. Now let&#8217;s check a sampled story:</p><blockquote><p>Story time: &#8222;h3I\v&#352;&#226;&#170;&#226;&#164;&#353;&#176;FA7G<br>'&#189;ndS4 m tt  .elhIeidoioms e  oaiydsseJgd<br>tceo,ty,<br>wSns  itt h hs  ud  rli eh rndg ee l  S ,yed dp i ra cam d"ae.Sodtro,eo ehoh h ahd lstbr tuv t,yeo t ho"<br>hlbLsth  usrnwymdecr  oi erswiii Tdw.we onhaourdh wtko temn epea  shoaws ecstSapniade   rg a   ,ehaf  e  scn ehdg dat  me i'hg osutah yaaaetsds<br>het atw obn h usuyhwuknnlnus  hla  isttctioih oer T r olwa hi  uuryTt m twannaw ee.ahem, rsf  nta snioe  as iep hirigyn!e o ut tSeos dcstagbui"bSn"tyawlud<br>h ueyi rtocg ar emtaae aoni   rl"ag l  fhn Tt b pte mbmah ahwt"aueEiva ai rv" hswr!eA .ekga ote edu, hyw ohoee?T wtt lrthnecmb'au hlend tytiuwdld E a,snaus.tc lddd otnHsbgre"tasdhidrMdtr ebtrhasa esc aa <br>meT mntyldmmpu s" <br>esnty.csye&#226;a itielt nh a tp ol oesn iie  obgfeS.sogskB  cen<br>  hdiaes e hssuo. </p></blockquote><p>As we can see, our dream of Skynet is still in the far distant future, however we are making progress; the model is starting to learn what appears to be the ASCII character range, word boundaries, and punctuation. Since our validation loss is still close to the training loss, we have room to simply add more parameters before overfitting takes over. We could do this by adding additional layers or increasing the embedding and/or hidden dimensions.</p><p>But before we throw even more capacity at the problem, I want to dig into the flow of various data streams throughout the network. The behaviors of these streams of data, both in the forward and backward direction, are called the <em>training dynamics. </em>In order to train deep networks, we have to ensure that the training dynamics are stable, otherwise gradient descent will diverge. If gradient descent diverges, then our model fails to learn anything useful. The most common culprit of unstable training dynamics are unstable gradients.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/language-modeling-part-2-training?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/language-modeling-part-2-training?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>The Gradient Must Flow</h2><p>Like the spice of Arrakis, the gradients within our network have to flow to each of the parameters in order for them to learn. Recall the update formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p.data = p.data + lr * -p.grad&quot;,&quot;id&quot;:&quot;FLODAMAOIH&quot;}" data-component-name="LatexBlockToDOM"></div><p>If <code>p.grad</code> is zero, the parameter value (<code>p.data</code>) doesn&#8217;t change. Likewise, if <code>p.grad</code> is too large, it can teleport us to completely different parts of the loss landscape without ever settling down into a minimum. So we need to understand and monitor the gradient values for each layer to ensure they don&#8217;t become degenerate.</p><p>For our model, we are using linear layers followed by tanh activation. To understand the gradients through these layers, we can analyze the graphs of tanh and its derivative:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bWrt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bWrt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 424w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 848w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 1272w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bWrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png" width="1059" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/931841da-5010-471f-ac8b-09b04107b21a_1059x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1059,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bWrt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 424w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 848w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 1272w, https://substackcdn.com/image/fetch/$s_!bWrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F931841da-5010-471f-ac8b-09b04107b21a_1059x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see the derivative (in red) vanishes to 0 as the absolute value of the input increases beyond roughly 3. This region beyond |3| is the <em>saturating region</em> of the tanh activation. What is the input to our tanh? The input is the output of the linear layer <code>y</code> (also called the preactivation):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\ny &amp;= x W + b \\\\\nz &amp;= tanh(y) \\\\\n\\end{align*}&quot;,&quot;id&quot;:&quot;HLIISQZKKS&quot;}" data-component-name="LatexBlockToDOM"></div><p>So as soon as the absolute value of <code>y</code> gets larger than 3, tanh itself becomes saturated, and the gradient on the tanh for that value will be essentially 0. This problem compounds in backpropagation with each additional layer because of the chain rule, which is multiplicative:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\frac{\\partial L}{\\partial W} = \\frac{\\partial L}{\\partial z}\\cdot\\frac{\\partial z}{\\partial y}\\cdot\\frac{\\partial y}{\\partial W}\n\\end{align*}&quot;,&quot;id&quot;:&quot;WDZZNHZTVL&quot;}" data-component-name="LatexBlockToDOM"></div><p>When the gradient of tanh is 0, the middle term <code>dz/dy</code> is 0, causing the entire expression to be 0. This 0 gets propagated down to the child nodes in the computational graph (i.e. layers which are closer to the beginning), causing learning to stagnate in these deeper layers as well.</p><h3>Visualizing Gradients</h3><p>To gain more intuition for this problem, it helps to <a href="https://colab.research.google.com/drive/1r4sDMqphipA4y4WYRJz2fzyJ8FV9qPOa#scrollTo=n7uwhg3UbRbO&amp;line=1&amp;uniqifier=1">visualize the gradient values</a> of different layers at various steps during training. Below is a histogram of each layer&#8217;s gradient after training for 100k steps. You can see from the spikes in the middle that several layers&#8217; gradients have vanished substantially:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mq_L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mq_L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mq_L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51131,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mq_L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!mq_L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1541a23-09c7-4f12-878a-3af541ab0f34_1500x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can double-click on these layers to see the evolution of the gradient for each neuron at various training steps. Below is the evolution of the gradient of the first linear layer. As the gradient tends to 0, the color of the corresponding point in the heatmap tends to purple. A vertical column of purple indicates a completely dead neuron - no gradient is flowing through it. The x-axis indicates the neuron and the y-axis is the batch:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;c4c16e8e-695d-48e6-b47c-d8e1a5029de3&quot;,&quot;duration&quot;:null}"></div><p>Below is the evolution of the last linear layer, i.e. the output layer. You can see it fares a little better than the first linear layer, yet still only has a few surviving neurons:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;0a199c82-9550-4ad0-b690-616025b382a0&quot;,&quot;duration&quot;:null}"></div><p>So what can we do to ensure better gradient flow? There are many options, more than I can cover in this post. However they mostly revolve around the same fundamental idea: we need to avoid inputs which lead to saturated non-linearities. In our example, this means avoiding too many inputs whose absolute value is greater than roughly 3, beyond which the derivative of the non-linearity, tanh, is zero. Other non-linearities will have different bounds, but the need to avoid these bounds is the same. Below we will look at a few techniques that are useful for mitigating excessive saturation.</p><h3>Xavier Initialization</h3><p>Glorot and Bengio <a href="https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf">figured out</a> that for tanh activations<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, keeping the variance of the preactivations close to one for every layer helps to create more stable gradient flow. The reason is that assuming the preactivations have mean 0 and variance 1, then approximately 95% of all values lie within the interval [-2, 2], thus 95% of the values will stay within the non-saturated region of tanh. </p><p>If we assume the inputs and weights are independent with mean 0, then we can calculate the variance of the jth neuron (assuming zero bias for simplicity):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\ny_j &amp;= \\sum_{i=1}^{n}x_iw_{ij} \\\\\nVar(y_j) &amp;= Var(\\sum_{i=1}^{n}x_iw_{ij}) \\\\\n&amp;= \\sum_{i=1}^{n}Var(x_iw_{ij}) \\\\\n&amp;= \\sum_{i=1}^{n}\\mathbb{E}((x_iw_{ij})^2) - \\mathbb{E}(x_i^2)\\mathbb{E}(w_{ij}^2) \\\\\n&amp;= \\sum_{i=1}^{n}Var(x_i)Var(w_{ij}) - 0 \\\\\n\\end{align*}&quot;,&quot;id&quot;:&quot;DEQTHIKHVF&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we further assume the weights and the inputs have a constant variance <code>Var(W)</code> and <code>Var(x)</code>, respectively, then we can derive an expression for <code>Var(W)</code> which provides a variance of 1 on <code>yj</code>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Var(y_j) = nVar(x)Var(W) = 1 \\implies Var(W) = \\frac{1}{ nVar(x)}&quot;,&quot;id&quot;:&quot;HGNLDDOIZM&quot;}" data-component-name="LatexBlockToDOM"></div><p>So for initialization we can choose a W such that the variance of W depends on the fan-in (<code>n</code>) of the layer, as well as the variance of the input. The input could be the actual input, or it could be the activation from a previous layer. This underscores the importance of normalizing the data input to the network - doing so ensures the assumptions of the above calculations are met and that the chain of variances can be one throughout the network. This is essentially what so-called &#8220;Xavier initialization&#8221; does. </p><p>After <a href="https://colab.research.google.com/drive/1r4sDMqphipA4y4WYRJz2fzyJ8FV9qPOa#scrollTo=P6x3cX6jcP-K&amp;line=28&amp;uniqifier=1">re-training the model</a> with PyTorch&#8217;s <code>xavier_uniform_ </code>on each linear layer, we get the following loss curve:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H-Gg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H-Gg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 424w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 848w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 1272w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H-Gg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png" width="1243" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1243,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82049,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H-Gg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 424w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 848w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 1272w, https://substackcdn.com/image/fetch/$s_!H-Gg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16404673-e28c-44c7-bd15-714eb812ede6_1243x788.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Notice the initial loss is much closer now to the expected baseline of <code>ln(nr_classes) = ln(174) = 5.15.</code> This change also brought the final validation loss down from 4.4 to 1.9 and the perplexity down from 22 to 5.73. Let&#8217;s sample a story:</p><blockquote><p>Story time: 1ap ue was laghoned thain hit wayl Ieind om tearouly ss og otciout the ton ittligher and reined veru&#8224;ne low at wis pois, shay dove. Eher weorej Io hatd ast Iraw. RWy and her hebfat hemstily hecrpeneser with he was and and hewtur to nacked itho<br><br>" brat. He ane sard a dore aflyen sconed. That was in tad, thery are sus is the boun hit'ur wor't beap ala bimtterigit on the bolea ci'rus. Tthe twand.<br><br>We. hem, tharent Ting to pave puririgen.<br><br>Aut theor dact gerilbeghty was ho sey herocghtryemeree to hackri" gily fin the dparpmbmas and "aver Iaw iore" haw so loqk ver"eded they dobee tike therthe cablaw hernd tearpund a hays. Th. The ded to shire raschilred boZ theas to chaaned Timmyy whipg saine. They yin tigingt nog her oldong to he cogie took the cent oudines a histo. <br><br>Woge hiund have hake, "athee lilly can sever bucher. Th they jaly Sveol.</p></blockquote><p>Still nonsense! But also much better. We see some actual English words now and both sentence and paragraph structure starting to emerge. All that just from specially chosen initialization values. Now let&#8217;s check the gradients. Here is the overall distribution:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F2Xp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F2Xp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F2Xp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46671,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F2Xp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!F2Xp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b036e46-9a4f-4082-9ad1-13aaaff45d34_1500x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see the peak is lower (from around 15000 to roughly 12000), and there appears to be just one layer, the embedding layer, which has significantly vanished gradients. This perhaps isn&#8217;t surprising, because it is the last layer in gradient descent, so any near-zero multiplicative affects from its ancestors in the computational graph are magnified there. Here is the evolution of the gradients for layer 2, the first linear layer:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;ba2277a9-d470-4bab-a36a-388eb5b4a670&quot;,&quot;duration&quot;:null}"></div><p>Interestingly, the gradients start out closer to zero in the first step relative to the non-Xavier init model. However as training continues, the gradients&#8217; magnitudes appear to get slightly larger on average, increasing at each step. If you look at the last frame, it looks like no neurons are completely dead - there are at least a few batches providing non-zero gradients.</p><p>There is alot more to initialization than was covered here. Xavier init is good for activations that are symmetric around 0, whereas Kaiming init is better suited towards asymmetric activations like the ReLU and its relatives.</p><p>One shortcoming of these special initialization techniques is they obviously only occur once. This can cause the network, especially deep networks, to have drifting activations over time. This drift can venture into saturating or exploding regions of activations, causing unstable gradients. One mitigation to this is to add normalization dynamically at each layer so that the inputs to the subsequent non-linearity stay in the activated region.</p><h3>Layer Normalization</h3><p>Many different types of normalization techniques exist. And you can apply normalization to different types of data - weights, preactivations, gradients, input data, etc. Layer normalization typically applies to the preactivations, i.e., the output y of the linear layer:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\ny &amp;= x W + b \\\\\n\\end{align*}&quot;,&quot;id&quot;:&quot;DGFSVRNFUC&quot;}" data-component-name="LatexBlockToDOM"></div><p><a href="https://arxiv.org/pdf/1607.06450">Layer Normalization</a> applies this normalization across the hidden unit dimension. This means the average and variance are computed across each neuron in a single batch. This is opposed to <a href="https://arxiv.org/abs/1502.03167">Batch Normalization</a>, which normalizes across the batch dimension:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z7UN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z7UN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 424w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 848w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 1272w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z7UN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png" width="793" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:793,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33104,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z7UN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 424w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 848w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 1272w, https://substackcdn.com/image/fetch/$s_!Z7UN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a964b9-f0a1-4aff-9e9b-9946863a18f0_793x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The advantage of layer norm over batch norm is that batch norm requires keeping a running average and variance throughout training so that it can be used later during inference. The reason is that inference has to gracefully handle a batch size of 1, but the average and variance over one element is at best noisy and at worst meaningless. So we have to retain the statistically significant batch statistics computed during training and reuse them during inference. On the other hand, layer norm works with any batch size, including batch size of 1 during inference and doesn&#8217;t require any special state between training and inference. The implementation is straightforward:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144283907\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-layernorm-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;layernorm.py content, created by cjams on 02:37AM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;layernorm.py\&quot;>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>LayerNorm</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>device</span>, <span class=pl-s1>num_features</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>gamma</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>ones</span>(<span class=pl-s1>num_features</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>zeros</span>(<span class=pl-s1>num_features</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__call__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>: <span class=pl-s1>torch</span>.<span class=pl-c1>Tensor</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-s1>x</span>.<span class=pl-c1>ndim</span> <span class=pl-c1>==</span> <span class=pl-c1>2</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>H</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>shape</span>[<span class=pl-c1>1</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>avg</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>mean</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>, <span class=pl-s1>keepdim</span><span class=pl-c1>=</span><span class=pl-c1>True</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>std</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>sqrt</span>(<span class=pl-c1>1</span> <span class=pl-c1>/</span> <span class=pl-c1>H</span> <span class=pl-c1>*</span> ((<span class=pl-s1>x</span> <span class=pl-c1>-</span> <span class=pl-s1>avg</span>)<span class=pl-c1>**</span><span class=pl-c1>2</span>).<span class=pl-c1>sum</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>, <span class=pl-s1>keepdim</span><span class=pl-c1>=</span><span class=pl-c1>True</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> (<span class=pl-s1>x</span> <span class=pl-c1>-</span> <span class=pl-s1>avg</span>) <span class=pl-c1>/</span> <span class=pl-s1>std</span> <span class=pl-c1>*</span> <span class=pl-s1>self</span>.<span class=pl-c1>gamma</span> <span class=pl-c1>+</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>self</span>.<span class=pl-c1>out</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>params</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layernorm-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-layernorm-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> [<span class=pl-s1>self</span>.<span class=pl-c1>gamma</span>, <span class=pl-s1>self</span>.<span class=pl-c1>bias</span>]</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/fa490c245bc00bb239970ecfd6737591/raw/07b5ad40595f09d92b102a0bf04d9e203bbf0e42/layernorm.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/fa490c245bc00bb239970ecfd6737591#file-layernorm-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          layernorm.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-68783a026c0c.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-68783a026c0c.css"><div id="gist144283907" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-layernorm-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="layernorm.py">
        <tbody><tr>
          <td id="file-layernorm-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-layernorm-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">LayerNorm</span>():</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-layernorm-py-LC2" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">device</span>, <span class="pl-s1">num_features</span>):</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-layernorm-py-LC3" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-layernorm-py-LC4" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">gamma</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">ones</span>(<span class="pl-s1">num_features</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-layernorm-py-LC5" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">zeros</span>(<span class="pl-s1">num_features</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-layernorm-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-layernorm-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__call__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>: <span class="pl-s1">torch</span>.<span class="pl-c1">Tensor</span>):</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-layernorm-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-s1">x</span>.<span class="pl-c1">ndim</span> <span class="pl-c1">==</span> <span class="pl-c1">2</span></td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-layernorm-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-layernorm-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">H</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">shape</span>[<span class="pl-c1">1</span>]</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-layernorm-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">avg</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">mean</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>, <span class="pl-s1">keepdim</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-layernorm-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">std</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">sqrt</span>(<span class="pl-c1">1</span> <span class="pl-c1">/</span> <span class="pl-c1">H</span> <span class="pl-c1">*</span> ((<span class="pl-s1">x</span> <span class="pl-c1">-</span> <span class="pl-s1">avg</span>)<span class="pl-c1">**</span><span class="pl-c1">2</span>).<span class="pl-c1">sum</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>, <span class="pl-s1">keepdim</span><span class="pl-c1">=</span><span class="pl-c1">True</span>))</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-layernorm-py-LC13" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-layernorm-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> (<span class="pl-s1">x</span> <span class="pl-c1">-</span> <span class="pl-s1">avg</span>) <span class="pl-c1">/</span> <span class="pl-s1">std</span> <span class="pl-c1">*</span> <span class="pl-s1">self</span>.<span class="pl-c1">gamma</span> <span class="pl-c1">+</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span></td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-layernorm-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">self</span>.<span class="pl-c1">out</span></td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-layernorm-py-LC16" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-layernorm-py-LC17" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">params</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layernorm-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-layernorm-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> [<span class="pl-s1">self</span>.<span class="pl-c1">gamma</span>, <span class="pl-s1">self</span>.<span class="pl-c1">bias</span>]</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/fa490c245bc00bb239970ecfd6737591/raw/07b5ad40595f09d92b102a0bf04d9e203bbf0e42/layernorm.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/fa490c245bc00bb239970ecfd6737591#file-layernorm-py" class="Link--inTextBlock">
          layernorm.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Let&#8217;s see what happens when we <a href="https://colab.research.google.com/drive/1r4sDMqphipA4y4WYRJz2fzyJ8FV9qPOa#scrollTo=pZyaqgZ0bYL8&amp;line=1&amp;uniqifier=1">add LayerNorm after each linear layer</a> and retrain. The loss curves are reaching a slightly lower minimum (and also slightly overfitting):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oTS3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oTS3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 424w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 848w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 1272w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oTS3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png" width="1243" height="802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:802,&quot;width&quot;:1243,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85726,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oTS3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 424w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 848w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 1272w, https://substackcdn.com/image/fetch/$s_!oTS3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5acd2c8-e891-4880-bc9d-5689dc5d0347_1243x802.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The final validation loss improved from 1.9 to 1.7. Perplexity improved from 5.73 to 4.50. Ok, now grab some coffee! It&#8217;s time for another story:</p><blockquote><p>Story time: Once dook had rine. It was tore on. They fubbe= ef&#173;, storisk &#226;un wookgar. The fabiing. He soulld eed sou jrean frea toy tayt their to so she prcedadn- and suym haw seare a vene ily. Polly wari. They were mad a wert to xime, the veasen" and grien to but furnyt want to sed talked a may.<br><br>The grid a foll. The "iling. %o loon a with ray fft. Tre sly call. He dayz away hoff the perter. The whone her to saibry. The smuny. She by timpy something to seve<br><br>fee the ground a cime then gratest ands and a with his wood! One day plays garond to curprasy was so ine of the back to take a smiled. Ore lefyor her!" Soras ecried. Jon the will and adpoly toor lantend his and story."</p></blockquote><p>Hey look at that, dook had rine! We&#8217;re getting more words now. And the early hints of a coherent story are taking shape.  A lot of the words are nonsense still, but the sentences appear to have a structure with basic subject-verb agreement.</p><p>Here are the gradients at the end of training:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ZDZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62674,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183195738?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 424w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 848w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 1272w, https://substackcdn.com/image/fetch/$s_!4ZDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F993e8efc-f887-431d-8f2f-27aabcd904e0_1500x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The peak is lower, down to 8000 from around 12000. It looks like many of the layers have a decent spread around 0 as well. Here is the evolution of the gradient of layer 2:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;e70f2470-a253-414e-a8f3-3955091082ac&quot;,&quot;duration&quot;:null}"></div><p>Once again we see the gradients slightly increasing in a controlled way, and no neurons appear dead on any of the steps.</p><h2>Conclusion</h2><p>In this post we&#8217;ve focused on understanding gradient flow and deployed two techniques for stabilizing gradients during training: Xavier initialization and LayerNorm. Of course there are other aspects of the training dynamics we could look at, especially the parameter update size and learning rate. There are also architectural tricks like <a href="https://arxiv.org/pdf/1512.03385">residual connections</a>. These connections provide an additional path for gradients to flow back to deep layers. They work by providing a direct linear path from one deep layer to another closer to the output, thereby bypassing the sequence of non-linearities and repeated matrix multiplications that can lead to instabilities. We applied these techniques to language modeling, but they apply generally to all neural networks. In the next post we will further improve the performance of our model by leveraging recurrence.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Connor's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Their findings applied to activation functions that are symmetric around zero, not just tanh</p></div></div>]]></content:encoded></item><item><title><![CDATA[Language Modeling, Part 1: Neural Probabilistic Language Model]]></title><description><![CDATA[I&#8217;ve a goal to understand how language models work, so I&#8217;ve been ramping up on their academic lineage and implementing what I learn in code.]]></description><link>https://www.connorjdavis.com/p/tour-de-language-modeling-part-1</link><guid isPermaLink="false">https://www.connorjdavis.com/p/tour-de-language-modeling-part-1</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Wed, 31 Dec 2025 19:54:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bp5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve a goal to understand how language models work, so I&#8217;ve been ramping up on their academic lineage and implementing what I learn in code. </p><p>This post is Part 1 of an n-part series on language modeling. Language modeling has a diverse range of problems within it, such as translation, semantic analysis, and prediction. For this series we will be looking at next token prediction. Specifically, given a sequence of tokens (like characters or words), we want to create a model which predicts the next token. Mathematically this amounts to finding a model which maximizes the likelihood of the next token conditioned on the previous tokens.</p><p>In this series, we are going to start from simple models and work our way up to the modern state of the art like transformers and also some more exotic architectures like text diffusion models.</p><p>For more background on these topics, I highly recommend Andrej Karpathy&#8217;s <a href="https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ">Zero to Hero</a> series on YouTube. His lectures were motivation for this series. For mathematical background, I recommend <a href="https://mml-book.github.io/book/mml-book.pdf">Mathematics of Machine Learning</a>. For detailed explanations of sequence modeling with practical examples, I recommend <a href="https://d2l.ai/index.html">Dive Into Deep Learning</a>.</p><h2>Neural Probabilistic Language Model</h2><p>Some of the first attempts of language prediction used frequency-based n-gram models. With n-gram models, the predictions are based on the frequency of n-grams in the training corpus. The work from <a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">Bengio et al.</a> improved upon the existing state of the art n-gram models way back in 2003 with their &#8220;Neural Probabilistic Language Model&#8221;. They were one of the first to successfully apply neural networks to the language prediction task. The main contribution from this paper was a solution to the curse of dimensionality problem via learned embeddings.</p><p>To understand the curse of dimensionality, consider the following sentences:</p><blockquote><p>The dog is running in the yard</p><p>The cat is drinking in the kitchen</p><p>The man is running in the street</p></blockquote><p>If only the first two are in the training set, then a classical n-gram approach, where the probabilities are assigned based on the frequency of their occurrence in the training data, will fail to predict the last sentence as very likely, since it never appeared in training. This seems wrong, since all three sentences are grammatically and semantically similar.</p><p>The root cause of this is the set of possible sentences is an extremely high dimensional space (and hence is cursed :)), and the n-gram approach has no mechanism to transfer probability mass found during training to unseen sentences.</p><h3>Overcoming the Curse of Dimensionality</h3><p>Bengio found a way around this using learned word embeddings. Embeddings are real-valued vectors that are used to numerically represent words. The vectors encode the syntax and semantics of each word. What is neat about this encoding is that it allows us to perform arithmetic on words, and to measure mathematically how close they are to each other via the geometric interpretation of vectors. So words with similar meaning like &#8220;mammal&#8221;, &#8220;dog&#8221;, &#8220;human&#8221; will have similar vectors and words like &#8220;it&#8221;, &#8220;the&#8221;, &#8220;a&#8221;, &#8220;and&#8221; will have similar vectors, however those two groups of vectors would not be close to each other in the embedding space.</p><p>As a simple example, we can suppose the embedding vector is three dimensional as seen below:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\vec{x} = \\begin{bmatrix}\n\n\\text{part of speech} \\\\\n\n\\text{inanimate} \\\\\n\n\\text{color} \\\\\n\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;YIKKNDMNSZ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Each word would have a vector of its own, for example mammal might be (3.6, -2.4, 0.1) and dog might be (3.7, -2.3, 0.5). We can then compute the Euclidean distance between them. Semantically similar words will  have a distance near zero.</p><p>So you may be thinking, OK then we just have to somehow come up with the &#8220;most important&#8221; features of a word, like part of speech and its semantic usage and create vectors that encode these that we come up with. But Bengio found that we can let the model learn what &#8220;features&#8221; are important based on the statistics of the training data. We don&#8217;t have to explicitly say that the first element of each vector represents part of speech, the second whether it represents something inanimate, etc. The model learns these values for us via backpropagation. </p><p>The reason this technique overcomes the curse of dimensionality is that predicting the next word given a sequence of words is likely to pick any word with a vector which is similar to that seen in the training set, not just the word that was seen in the training set. So predicting the next word of</p><blockquote><p>The man was hungry so he ate the ____.</p></blockquote><p>would assign a high likelihood to any type of word that semantically describes food, even if the only instance of the phrase in training was e.g. &#8220;The man was hungry so he ate the banana&#8221;. This is what it means to &#8220;transfer&#8221; probability mass to similar words. This transfer allows for an exponential increase in the sequences which the model finds likely, leading to better generalization.</p><h2>Visualizing Embeddings with t-SNE</h2><p>Unfortunately, humans can&#8217;t visualize spaces with a high number of dimensions, given our senses have evolved in three dimensional space. This makes gaining intuition of high-dimensional objects like embedding vectors difficult.</p><p>Fortunately, there are algorithms such as t-SNE for projecting high-dimension vectors into two dimensions. This allows us some sense of the clustering that is present among the embedding vectors and gives us intuition for the similarity between the words. </p><p>You can see an example of this below that uses the <a href="https://huggingface.co/datasets/roneneldan/TinyStories">TinyStories</a> dataset. It uses t-SNE to project the 32-dimensional embeddings of lower case letters to two dimensions. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bp5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bp5j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bp5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33863,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183082981?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bp5j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!bp5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff1632d-e32e-4ce4-b748-ae22370cc1f7_800x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The vowels are in red and consonants in blue. The image above is the t-SNE projection from the random initialization of the embedding vector.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ic9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ic9E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ic9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34759,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183082981?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ic9E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!ic9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c562af-dea6-4839-8afc-bc8662857ff4_800x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The image above is the projection after 30,000 training iterations of a character-level language model that uses Bengio-style embeddings for each character. You can see as the training progresses, the embeddings of semantically similar characters like vowels begin to cluster together.</p><h2>Building the Model</h2><p>Let&#8217;s implement a model similar to the one from the Bengio paper. We will use the <a href="https://huggingface.co/datasets/roneneldan/TinyStories">TinyStories</a> dataset. This dataset features a collection of short stories generated from ChatGPT. We will train a character-level language model using PyTorch that will be able to generate new short stories for us.</p><p>Note that below will just have snippets; here is a <a href="https://colab.research.google.com/drive/1BjDK0nVW5J9XkmQuInvp9X4KdZmvtz6x?usp=sharing">colab notebook</a> if you want to have a complete picture.</p><h3>Prepping the Dataset</h3><p>The first step is to get familiar with the data. We need to download the dataset and convert it into a format that we can use for training:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082181\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-dataprep-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;dataprep.py content, created by cjams on 07:21PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;dataprep.py\&quot;>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>matplotlib</span>.<span class=pl-s1>pyplot</span> <span class=pl-k>as</span> <span class=pl-s1>plt</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>numpy</span> <span class=pl-k>as</span> <span class=pl-s1>np</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>datasets</span> <span class=pl-k>import</span> <span class=pl-s1>load_dataset</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>d</span> <span class=pl-c1>=</span> <span class=pl-en>load_dataset</span>(&#8217;<span class=pl-s1>roneneldan</span><span class=pl-c1>/</span><span class=pl-v>TinyStories</span>&#8217;)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>data_trn</span> <span class=pl-c1>=</span> <span class=pl-en>list</span>(<span class=pl-s1>d</span>[&#8217;<span class=pl-s1>train</span>&#8217;].<span class=pl-c1>to_pandas</span>().<span class=pl-c1>to_dict</span>().<span class=pl-c1>values</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>data_val</span> <span class=pl-c1>=</span> <span class=pl-en>list</span>(<span class=pl-s1>d</span>[&#8217;<span class=pl-s1>validation</span>&#8217;].<span class=pl-c1>to_pandas</span>().<span class=pl-c1>to_dict</span>().<span class=pl-c1>values</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Get the unique characters across the training set</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>unique_chars</span> <span class=pl-c1>=</span> <span class=pl-en>set</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>story_dict</span> <span class=pl-c1>in</span> <span class=pl-s1>data_trn</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>story_text</span> <span class=pl-c1>in</span> <span class=pl-s1>story_dict</span>.<span class=pl-c1>values</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>unique_chars</span>.<span class=pl-c1>update</span>(<span class=pl-en>set</span>(<span class=pl-s1>story_text</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>unique_chars</span> <span class=pl-c1>=</span> &#8216;&#8217;.<span class=pl-en>join</span>(<span class=pl-en>sorted</span>(<span class=pl-s1>unique_chars</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># We use &#8216;^&#8217; as a special start character in the model</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataprep-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-dataprep-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>assert</span> &#8216;<span class=pl-c1>^</span>&#8217; <span class=pl-c1>not</span> <span class=pl-s1>in</span> <span class=pl-s1>unique_chars</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/53ece2c93a2995f10b679405c8fbf3f1/raw/a72d6499cf32733726dc643f27a7103f3f0be8d8/dataprep.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/53ece2c93a2995f10b679405c8fbf3f1#file-dataprep-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          dataprep.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082181" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-dataprep-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="dataprep.py">
        <tbody><tr>
          <td id="file-dataprep-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-dataprep-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-dataprep-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-dataprep-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">matplotlib</span>.<span class="pl-s1">pyplot</span> <span class="pl-k">as</span> <span class="pl-s1">plt</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-dataprep-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">numpy</span> <span class="pl-k">as</span> <span class="pl-s1">np</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-dataprep-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-dataprep-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">datasets</span> <span class="pl-k">import</span> <span class="pl-s1">load_dataset</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-dataprep-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-dataprep-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">d</span> <span class="pl-c1">=</span> <span class="pl-en">load_dataset</span>(&#8217;<span class="pl-s1">roneneldan</span><span class="pl-c1">/</span><span class="pl-v">TinyStories</span>&#8217;)</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-dataprep-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">data_trn</span> <span class="pl-c1">=</span> <span class="pl-en">list</span>(<span class="pl-s1">d</span>[&#8217;<span class="pl-s1">train</span>&#8217;].<span class="pl-c1">to_pandas</span>().<span class="pl-c1">to_dict</span>().<span class="pl-c1">values</span>())</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-dataprep-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">data_val</span> <span class="pl-c1">=</span> <span class="pl-en">list</span>(<span class="pl-s1">d</span>[&#8217;<span class="pl-s1">validation</span>&#8217;].<span class="pl-c1">to_pandas</span>().<span class="pl-c1">to_dict</span>().<span class="pl-c1">values</span>())</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-dataprep-py-LC11" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-dataprep-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Get the unique characters across the training set</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-dataprep-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">unique_chars</span> <span class="pl-c1">=</span> <span class="pl-en">set</span>()</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-dataprep-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-dataprep-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">story_dict</span> <span class="pl-c1">in</span> <span class="pl-s1">data_trn</span>:</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-dataprep-py-LC16" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">story_text</span> <span class="pl-c1">in</span> <span class="pl-s1">story_dict</span>.<span class="pl-c1">values</span>():</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-dataprep-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">unique_chars</span>.<span class="pl-c1">update</span>(<span class="pl-en">set</span>(<span class="pl-s1">story_text</span>))</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-dataprep-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-dataprep-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">unique_chars</span> <span class="pl-c1">=</span> &#8216;&#8217;.<span class="pl-en">join</span>(<span class="pl-en">sorted</span>(<span class="pl-s1">unique_chars</span>))</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-dataprep-py-LC20" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-dataprep-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># We use &#8216;^&#8217; as a special start character in the model</span></td>
        </tr>
        <tr>
          <td id="file-dataprep-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-dataprep-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-k">assert</span> &#8216;<span class="pl-c1">^</span>&#8217; <span class="pl-c1">not</span> <span class="pl-s1">in</span> <span class="pl-s1">unique_chars</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/53ece2c93a2995f10b679405c8fbf3f1/raw/a72d6499cf32733726dc643f27a7103f3f0be8d8/dataprep.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/53ece2c93a2995f10b679405c8fbf3f1#file-dataprep-py" class="Link--inTextBlock">
          dataprep.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>I validated experimentally that &#8216;^&#8217; isn&#8217;t in the training set at all, so I opted to use it as a special starting character as we will see later. Now we can create our char-to-index/index-to-char mappings as well as a dataset creation function. This function build_dataset creates sequences of character indices, each with length equal to <code>ctx_window</code>.  </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082364\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-dataset-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;dataset.py content, created by cjams on 07:36PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;dataset.py\&quot;>\n        <tr>\n          <td id=\&quot;file-dataset-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>stoi</span> <span class=pl-c1>=</span> {<span class=pl-s1>s</span>:<span class=pl-s1>i</span><span class=pl-c1>+</span><span class=pl-c1>1</span> <span class=pl-k>for</span> <span class=pl-s1>i</span>, <span class=pl-s1>s</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>unique_chars</span>)}</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>stoi</span>[&#8217;<span class=pl-c1>^</span>&#8217;] <span class=pl-c1>=</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>itos</span> <span class=pl-c1>=</span> {<span class=pl-s1>i</span>:<span class=pl-s1>s</span> <span class=pl-k>for</span> <span class=pl-s1>s</span>, <span class=pl-s1>i</span> <span class=pl-c1>in</span> <span class=pl-s1>stoi</span>.<span class=pl-c1>items</span>()}</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>def</span> <span class=pl-en>build_dataset</span>(<span class=pl-s1>data</span>, <span class=pl-s1>ctx_window</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c1>X</span>, <span class=pl-c1>Y</span> <span class=pl-c1>=</span> [], []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>s</span> <span class=pl-c1>in</span> <span class=pl-s1>data</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>ctx</span> <span class=pl-c1>=</span> [<span class=pl-c1>0</span>] <span class=pl-c1>*</span> <span class=pl-s1>ctx_window</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>c</span> <span class=pl-c1>in</span> <span class=pl-s1>s</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c1>X</span>.<span class=pl-c1>append</span>(<span class=pl-s1>ctx</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c1>Y</span>.<span class=pl-c1>append</span>(<span class=pl-s1>stoi</span>[<span class=pl-s1>c</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>ctx</span> <span class=pl-c1>=</span> <span class=pl-s1>ctx</span>[<span class=pl-c1>1</span>:] <span class=pl-c1>+</span> [<span class=pl-s1>stoi</span>[<span class=pl-s1>c</span>]]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c1>X</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-c1>X</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c1>Y</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-c1>Y</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-dataset-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-dataset-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>return</span> <span class=pl-c1>X</span>, <span class=pl-c1>Y</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/fe9925c33307421942ba3762b7c1e769/raw/854c96812e8051e17a003f9d6af90a808582cd6c/dataset.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/fe9925c33307421942ba3762b7c1e769#file-dataset-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          dataset.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082364" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-dataset-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="dataset.py">
        <tbody><tr>
          <td id="file-dataset-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-dataset-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">stoi</span> <span class="pl-c1">=</span> {<span class="pl-s1">s</span>:<span class="pl-s1">i</span><span class="pl-c1">+</span><span class="pl-c1">1</span> <span class="pl-k">for</span> <span class="pl-s1">i</span>, <span class="pl-s1">s</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">unique_chars</span>)}</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-dataset-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">stoi</span>[&#8217;<span class="pl-c1">^</span>&#8217;] <span class="pl-c1">=</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-dataset-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-dataset-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">itos</span> <span class="pl-c1">=</span> {<span class="pl-s1">i</span>:<span class="pl-s1">s</span> <span class="pl-k">for</span> <span class="pl-s1">s</span>, <span class="pl-s1">i</span> <span class="pl-c1">in</span> <span class="pl-s1">stoi</span>.<span class="pl-c1">items</span>()}</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-dataset-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-dataset-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">def</span> <span class="pl-en">build_dataset</span>(<span class="pl-s1">data</span>, <span class="pl-s1">ctx_window</span>):</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-dataset-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">X</span>, <span class="pl-c1">Y</span> <span class="pl-c1">=</span> [], []</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-dataset-py-LC7" class="blob-code blob-code-inner js-file-line">    </td>
        </tr>
        <tr>
          <td id="file-dataset-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-dataset-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">s</span> <span class="pl-c1">in</span> <span class="pl-s1">data</span>:</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-dataset-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">ctx</span> <span class="pl-c1">=</span> [<span class="pl-c1">0</span>] <span class="pl-c1">*</span> <span class="pl-s1">ctx_window</span></td>
        </tr>
        <tr>
          <td id="file-dataset-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-dataset-py-LC10" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-dataset-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-dataset-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">c</span> <span class="pl-c1">in</span> <span class="pl-s1">s</span>:</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-dataset-py-LC12" class="blob-code blob-code-inner js-file-line">            <span class="pl-c1">X</span>.<span class="pl-c1">append</span>(<span class="pl-s1">ctx</span>)</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-dataset-py-LC13" class="blob-code blob-code-inner js-file-line">            <span class="pl-c1">Y</span>.<span class="pl-c1">append</span>(<span class="pl-s1">stoi</span>[<span class="pl-s1">c</span>])</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-dataset-py-LC14" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">ctx</span> <span class="pl-c1">=</span> <span class="pl-s1">ctx</span>[<span class="pl-c1">1</span>:] <span class="pl-c1">+</span> [<span class="pl-s1">stoi</span>[<span class="pl-s1">c</span>]]</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-dataset-py-LC15" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-dataset-py-LC16" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">X</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-c1">X</span>)</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-dataset-py-LC17" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">Y</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-c1">Y</span>)</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-dataset-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-dataset-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-dataset-py-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> <span class="pl-c1">X</span>, <span class="pl-c1">Y</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/fe9925c33307421942ba3762b7c1e769/raw/854c96812e8051e17a003f9d6af90a808582cd6c/dataset.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/fe9925c33307421942ba3762b7c1e769#file-dataset-py" class="Link--inTextBlock">
          dataset.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><h3>Defining Layers</h3><p>Now that we have our dataset wrangled, we can think about adding the layers needed for our network. Our initial network will use the following forward pass:</p><p>1. Embedding. This maps input tokens to their corresponding embedding vectors.</p><p>2. Flatten. This is a simple reshape that concatenates the embedding vectors column-wise.</p><p>3. Linear. This applies an affine transformation to the input.</p><p>4. Tanh. This is our activation function. Performs a pointwise tanh to each value from the preceding linear layer.</p><p>5. Linear. This applies a final linear transformation to the tanh activations.</p><p>The output from the final Linear layer will then be run through <code>F.cross_entropy</code> for the loss. This implementation is pretty close to what is in the Bengio paper. In later posts we add some bells and whistles to incrementally improve the performance. Here is the code for each layer:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082387\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-layers-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;layers.py content, created by cjams on 07:38PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;layers.py\&quot;>\n        <tr>\n          <td id=\&quot;file-layers-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-layers-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Embedding</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-layers-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>device</span>, <span class=pl-s1>num_embeddings</span>, <span class=pl-s1>embedding_dim</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-layers-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-layers-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>weight</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randn</span>(<span class=pl-s1>num_embeddings</span>, <span class=pl-s1>embedding_dim</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-layers-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-layers-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__call__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-layers-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>embedding</span>(<span class=pl-s1>x</span>, <span class=pl-s1>self</span>.<span class=pl-c1>weight</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-layers-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>self</span>.<span class=pl-c1>out</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-layers-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-layers-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>params</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-layers-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> [<span class=pl-s1>self</span>.<span class=pl-c1>weight</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-layers-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-layers-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Flatten</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-layers-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>input_dim1</span>, <span class=pl-s1>input_dim2</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-layers-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-layers-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>input_dim1</span> <span class=pl-c1>=</span> <span class=pl-s1>input_dim1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-layers-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>input_dim2</span> <span class=pl-c1>=</span> <span class=pl-s1>input_dim2</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-layers-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-layers-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__call__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-layers-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-s1>x</span>.<span class=pl-c1>ndim</span> <span class=pl-c1>==</span> <span class=pl-c1>3</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-layers-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-s1>self</span>.<span class=pl-c1>input_dim1</span> <span class=pl-c1>*</span> <span class=pl-s1>self</span>.<span class=pl-c1>input_dim2</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-layers-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>self</span>.<span class=pl-c1>out</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-layers-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-layers-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>params</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-layers-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-layers-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-layers-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Linear</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-layers-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>device</span>, <span class=pl-s1>in_features</span>, <span class=pl-s1>out_features</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-layers-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-layers-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>weight</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randn</span>(<span class=pl-s1>in_features</span>, <span class=pl-s1>out_features</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-layers-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>zeros</span>(<span class=pl-s1>out_features</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>) <span class=pl-k>if</span> <span class=pl-s1>bias</span> <span class=pl-k>else</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-layers-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-layers-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__call__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-layers-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span> @ <span class=pl-s1>self</span>.<span class=pl-c1>weight</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-layers-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-c1><span class=pl-c1>is</span> <span class=pl-c1>not</span></span> <span class=pl-c1>None</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-layers-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>+=</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-layers-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-layers-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>self</span>.<span class=pl-c1>out</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-layers-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-layers-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>params</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-layers-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> [<span class=pl-s1>self</span>.<span class=pl-c1>weight</span>] <span class=pl-c1>+</span> ([] <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-c1>is</span> <span class=pl-c1>None</span> <span class=pl-k>else</span> [<span class=pl-s1>self</span>.<span class=pl-c1>bias</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-layers-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-layers-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Tanh</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-layers-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-layers-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-layers-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-layers-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__call__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-layers-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>out</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>tanh</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-layers-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>self</span>.<span class=pl-c1>out</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-layers-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-layers-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>params</span>(<span class=pl-s1>self</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-layers-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-layers-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> []</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/e25cac28c0703c250739fab0bd5414b7/raw/fbbbe0784ae09785c45ae90745984161470c5943/layers.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/e25cac28c0703c250739fab0bd5414b7#file-layers-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          layers.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082387" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-layers-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="layers.py">
        <tbody><tr>
          <td id="file-layers-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-layers-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Embedding</span>():</td>
        </tr>
        <tr>
          <td id="file-layers-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-layers-py-LC2" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">device</span>, <span class="pl-s1">num_embeddings</span>, <span class="pl-s1">embedding_dim</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-layers-py-LC3" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-layers-py-LC4" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">weight</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randn</span>(<span class="pl-s1">num_embeddings</span>, <span class="pl-s1">embedding_dim</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-layers-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-layers-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-layers-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__call__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-layers-py-LC7" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">embedding</span>(<span class="pl-s1">x</span>, <span class="pl-s1">self</span>.<span class="pl-c1">weight</span>)</td>
        </tr>
        <tr>
          <td id="file-layers-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-layers-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">self</span>.<span class="pl-c1">out</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-layers-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-layers-py-LC10" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">params</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-layers-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> [<span class="pl-s1">self</span>.<span class="pl-c1">weight</span>]</td>
        </tr>
        <tr>
          <td id="file-layers-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-layers-py-LC12" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-layers-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Flatten</span>():</td>
        </tr>
        <tr>
          <td id="file-layers-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-layers-py-LC14" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">input_dim1</span>, <span class="pl-s1">input_dim2</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-layers-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-layers-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">input_dim1</span> <span class="pl-c1">=</span> <span class="pl-s1">input_dim1</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-layers-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">input_dim2</span> <span class="pl-c1">=</span> <span class="pl-s1">input_dim2</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-layers-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-layers-py-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__call__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-layers-py-LC20" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-s1">x</span>.<span class="pl-c1">ndim</span> <span class="pl-c1">==</span> <span class="pl-c1">3</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-layers-py-LC21" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-s1">self</span>.<span class="pl-c1">input_dim1</span> <span class="pl-c1">*</span> <span class="pl-s1">self</span>.<span class="pl-c1">input_dim2</span>)</td>
        </tr>
        <tr>
          <td id="file-layers-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-layers-py-LC22" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">self</span>.<span class="pl-c1">out</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-layers-py-LC23" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-layers-py-LC24" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">params</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-layers-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> []</td>
        </tr>
        <tr>
          <td id="file-layers-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-layers-py-LC26" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-layers-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Linear</span>():</td>
        </tr>
        <tr>
          <td id="file-layers-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-layers-py-LC28" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">device</span>, <span class="pl-s1">in_features</span>, <span class="pl-s1">out_features</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-layers-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-layers-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">weight</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randn</span>(<span class="pl-s1">in_features</span>, <span class="pl-s1">out_features</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-layers-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-layers-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">zeros</span>(<span class="pl-s1">out_features</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>) <span class="pl-k">if</span> <span class="pl-s1">bias</span> <span class="pl-k">else</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-layers-py-LC32" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-layers-py-LC33" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__call__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-layers-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span> @ <span class="pl-s1">self</span>.<span class="pl-c1">weight</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-layers-py-LC35" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-c1"><span class="pl-c1">is</span> <span class="pl-c1">not</span></span> <span class="pl-c1">None</span>:</td>
        </tr>
        <tr>
          <td id="file-layers-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-layers-py-LC36" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">+=</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-layers-py-LC37" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-layers-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">self</span>.<span class="pl-c1">out</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-layers-py-LC39" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-layers-py-LC40" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">params</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-layers-py-LC41" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> [<span class="pl-s1">self</span>.<span class="pl-c1">weight</span>] <span class="pl-c1">+</span> ([] <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-c1">is</span> <span class="pl-c1">None</span> <span class="pl-k">else</span> [<span class="pl-s1">self</span>.<span class="pl-c1">bias</span>])</td>
        </tr>
        <tr>
          <td id="file-layers-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-layers-py-LC42" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-layers-py-LC43" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Tanh</span>():</td>
        </tr>
        <tr>
          <td id="file-layers-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-layers-py-LC44" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-layers-py-LC45" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-layers-py-LC46" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-layers-py-LC47" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__call__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-layers-py-LC48" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">out</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">tanh</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-layers-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-layers-py-LC49" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">self</span>.<span class="pl-c1">out</span></td>
        </tr>
        <tr>
          <td id="file-layers-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-layers-py-LC50" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-layers-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-layers-py-LC51" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">params</span>(<span class="pl-s1">self</span>):</td>
        </tr>
        <tr>
          <td id="file-layers-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-layers-py-LC52" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> []</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/e25cac28c0703c250739fab0bd5414b7/raw/fbbbe0784ae09785c45ae90745984161470c5943/layers.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/e25cac28c0703c250739fab0bd5414b7#file-layers-py" class="Link--inTextBlock">
          layers.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><h3>Building the Training Loop</h3><p>Now that we have our layers, we can build the training loop. The loop will use a simple stochastic gradient descent with a fixed learning rate. In later posts we will add in optimizers like Adam for increased performance. </p><p>First we will define some hyperparameters of the training loop:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082410\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-hyperparams-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;hyperparams.py content, created by cjams on 07:39PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;hyperparams.py\&quot;>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>ctx_window</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>max_step</span> <span class=pl-c1>=</span> <span class=pl-c1>100000</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>batch_size</span> <span class=pl-c1>=</span> <span class=pl-c1>64</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>embed_dim</span> <span class=pl-c1>=</span> <span class=pl-c1>32</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>hidden_size</span> <span class=pl-c1>=</span> <span class=pl-c1>256</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>lr</span> <span class=pl-c1>=</span> <span class=pl-c1>1e-3</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>vocab_size</span> <span class=pl-c1>=</span> <span class=pl-en>len</span>(<span class=pl-s1>stoi</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-hyperparams-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-hyperparams-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>device</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>device</span>(&#8221;<span class=pl-s1>cuda</span>&#8221; <span class=pl-k>if</span> <span class=pl-s1>torch</span>.<span class=pl-c1>cuda</span>.<span class=pl-c1>is_available</span>() <span class=pl-k>else</span> &#8220;<span class=pl-s1>cpu</span>&#8221;)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/ac22f670285ea9ce56d19c4c4d765129/raw/754940faddd3c3faf126a62c931d88a6693795a0/hyperparams.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/ac22f670285ea9ce56d19c4c4d765129#file-hyperparams-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          hyperparams.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082410" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-hyperparams-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="hyperparams.py">
        <tbody><tr>
          <td id="file-hyperparams-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-hyperparams-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">ctx_window</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-hyperparams-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">max_step</span> <span class="pl-c1">=</span> <span class="pl-c1">100000</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-hyperparams-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">batch_size</span> <span class="pl-c1">=</span> <span class="pl-c1">64</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-hyperparams-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">embed_dim</span> <span class="pl-c1">=</span> <span class="pl-c1">32</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-hyperparams-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">hidden_size</span> <span class="pl-c1">=</span> <span class="pl-c1">256</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-hyperparams-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">lr</span> <span class="pl-c1">=</span> <span class="pl-c1">1e-3</span></td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-hyperparams-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">vocab_size</span> <span class="pl-c1">=</span> <span class="pl-en">len</span>(<span class="pl-s1">stoi</span>)</td>
        </tr>
        <tr>
          <td id="file-hyperparams-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-hyperparams-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">device</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">device</span>(&#8221;<span class="pl-s1">cuda</span>&#8221; <span class="pl-k">if</span> <span class="pl-s1">torch</span>.<span class="pl-c1">cuda</span>.<span class="pl-c1">is_available</span>() <span class="pl-k">else</span> &#8220;<span class="pl-s1">cpu</span>&#8221;)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/ac22f670285ea9ce56d19c4c4d765129/raw/754940faddd3c3faf126a62c931d88a6693795a0/hyperparams.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/ac22f670285ea9ce56d19c4c4d765129#file-hyperparams-py" class="Link--inTextBlock">
          hyperparams.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>I chose these mostly arbitrarily. A <code>ctx_window</code> of 8 means the model has a fixed &#8220;history&#8221; of eight characters, and will be predicting the next character based on these 8. The vocab size (i.e. number of characters) for this dataset is 175 including the special character, so an embedding dimension of 32 seemed reasonable.</p><p>Next we can define our model, initialize the parameters, and create our training dataset:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082433\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-basic-bengio-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;basic-bengio.py content, created by cjams on 07:41PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;basic-bengio.py\&quot;>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>model</span> <span class=pl-c1>=</span> [</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Embedding</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>num_embeddings</span><span class=pl-c1>=</span><span class=pl-s1>vocab_size</span>, <span class=pl-s1>embedding_dim</span><span class=pl-c1>=</span><span class=pl-s1>embed_dim</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Flatten</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>input_dim1</span><span class=pl-c1>=</span><span class=pl-s1>ctx_window</span>, <span class=pl-s1>input_dim2</span><span class=pl-c1>=</span><span class=pl-s1>embed_dim</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>ctx_window</span><span class=pl-c1>*</span><span class=pl-s1>embed_dim</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>True</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Tanh</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>Linear</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>, <span class=pl-s1>in_features</span><span class=pl-c1>=</span><span class=pl-s1>hidden_size</span>, <span class=pl-s1>out_features</span><span class=pl-c1>=</span><span class=pl-s1>vocab_size</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>params</span> <span class=pl-c1>=</span> [<span class=pl-s1>p</span> <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span> <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>layer</span>.<span class=pl-c1>params</span>()]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Enable gradients for the learnable parameters</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>p</span>.<span class=pl-c1>requires_grad</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Construct train and validation sets (note we leave the data_val</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># set as a hold-out test set after we are done tuning hyperparams)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>X_trn</span>, <span class=pl-v>Y_trn</span> <span class=pl-c1>=</span> <span class=pl-en>build_dataset</span>(<span class=pl-s1>data_trn</span>[:<span class=pl-c1>20000</span>], <span class=pl-s1>ctx_window</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>X_val</span>, <span class=pl-v>Y_val</span> <span class=pl-c1>=</span> <span class=pl-en>build_dataset</span>(<span class=pl-s1>data_trn</span>[<span class=pl-c1>20000</span>:<span class=pl-c1>22000</span>], <span class=pl-s1>ctx_window</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>X_trn</span> <span class=pl-c1>=</span> <span class=pl-v>X_trn</span>.<span class=pl-c1>to</span>(<span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>Y_trn</span> <span class=pl-c1>=</span> <span class=pl-v>Y_trn</span>.<span class=pl-c1>to</span>(<span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>X_val</span> <span class=pl-c1>=</span> <span class=pl-v>X_val</span>.<span class=pl-c1>to</span>(<span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-v>Y_val</span> <span class=pl-c1>=</span> <span class=pl-v>Y_val</span>.<span class=pl-c1>to</span>(<span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Create RNG</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic-bengio-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-basic-bengio-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>g</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>Generator</span>(<span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>).<span class=pl-c1>manual_seed</span>(<span class=pl-c1>42</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/cebcf41b630e7c0af64094ca5b038566/raw/b98c98c7257865f8809734200a1b29eeed438dbd/basic-bengio.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/cebcf41b630e7c0af64094ca5b038566#file-basic-bengio-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          basic-bengio.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082433" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-basic-bengio-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="basic-bengio.py">
        <tbody><tr>
          <td id="file-basic-bengio-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-basic-bengio-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">model</span> <span class="pl-c1">=</span> [</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-basic-bengio-py-LC2" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Embedding</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">num_embeddings</span><span class="pl-c1">=</span><span class="pl-s1">vocab_size</span>, <span class="pl-s1">embedding_dim</span><span class="pl-c1">=</span><span class="pl-s1">embed_dim</span>),</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-basic-bengio-py-LC3" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Flatten</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">input_dim1</span><span class="pl-c1">=</span><span class="pl-s1">ctx_window</span>, <span class="pl-s1">input_dim2</span><span class="pl-c1">=</span><span class="pl-s1">embed_dim</span>),</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-basic-bengio-py-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">ctx_window</span><span class="pl-c1">*</span><span class="pl-s1">embed_dim</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">True</span>),</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-basic-bengio-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Tanh</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>),</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-basic-bengio-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">Linear</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>, <span class="pl-s1">in_features</span><span class="pl-c1">=</span><span class="pl-s1">hidden_size</span>, <span class="pl-s1">out_features</span><span class="pl-c1">=</span><span class="pl-s1">vocab_size</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-basic-bengio-py-LC7" class="blob-code blob-code-inner js-file-line">]</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-basic-bengio-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-basic-bengio-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">params</span> <span class="pl-c1">=</span> [<span class="pl-s1">p</span> <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span> <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">layer</span>.<span class="pl-c1">params</span>()]</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-basic-bengio-py-LC10" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-basic-bengio-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Enable gradients for the learnable parameters</span></td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-basic-bengio-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-basic-bengio-py-LC13" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">p</span>.<span class="pl-c1">requires_grad</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span></td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-basic-bengio-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-basic-bengio-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Construct train and validation sets (note we leave the data_val</span></td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-basic-bengio-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># set as a hold-out test set after we are done tuning hyperparams)</span></td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-basic-bengio-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-v">X_trn</span>, <span class="pl-v">Y_trn</span> <span class="pl-c1">=</span> <span class="pl-en">build_dataset</span>(<span class="pl-s1">data_trn</span>[:<span class="pl-c1">20000</span>], <span class="pl-s1">ctx_window</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-basic-bengio-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-v">X_val</span>, <span class="pl-v">Y_val</span> <span class="pl-c1">=</span> <span class="pl-en">build_dataset</span>(<span class="pl-s1">data_trn</span>[<span class="pl-c1">20000</span>:<span class="pl-c1">22000</span>], <span class="pl-s1">ctx_window</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-basic-bengio-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-v">X_trn</span> <span class="pl-c1">=</span> <span class="pl-v">X_trn</span>.<span class="pl-c1">to</span>(<span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-basic-bengio-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-v">Y_trn</span> <span class="pl-c1">=</span> <span class="pl-v">Y_trn</span>.<span class="pl-c1">to</span>(<span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-basic-bengio-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-v">X_val</span> <span class="pl-c1">=</span> <span class="pl-v">X_val</span>.<span class="pl-c1">to</span>(<span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-basic-bengio-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-v">Y_val</span> <span class="pl-c1">=</span> <span class="pl-v">Y_val</span>.<span class="pl-c1">to</span>(<span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-basic-bengio-py-LC23" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-basic-bengio-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Create RNG</span></td>
        </tr>
        <tr>
          <td id="file-basic-bengio-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-basic-bengio-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">g</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">Generator</span>(<span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>).<span class="pl-c1">manual_seed</span>(<span class="pl-c1">42</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/cebcf41b630e7c0af64094ca5b038566/raw/b98c98c7257865f8809734200a1b29eeed438dbd/basic-bengio.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/cebcf41b630e7c0af64094ca5b038566#file-basic-bengio-py" class="Link--inTextBlock">
          basic-bengio.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Finally we can implement the full forward and backward pass:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082466\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bengio-train-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bengio-train.py content, created by cjams on 07:43PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bengio-train.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>def</span> <span class=pl-en>plot_loss</span>(<span class=pl-s1>trn_loss</span>, <span class=pl-s1>val_loss</span><span class=pl-c1>=</span><span class=pl-c1>None</span>, <span class=pl-s1>title</span><span class=pl-c1>=</span>&#8221;<span class=pl-v>Loss</span> <span class=pl-v>Curves</span>&#8221;):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>figure</span>(<span class=pl-s1>figsize</span><span class=pl-c1>=</span>(<span class=pl-c1>10</span>, <span class=pl-c1>6</span>)) </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>xticks</span>(<span class=pl-s1>fontsize</span><span class=pl-c1>=</span><span class=pl-c1>12</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>yticks</span>(<span class=pl-s1>fontsize</span><span class=pl-c1>=</span><span class=pl-c1>12</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>title</span>(<span class=pl-s1>title</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>legends</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>assert</span> <span class=pl-en>len</span>(<span class=pl-s1>trn_loss</span>) <span class=pl-c1>%</span> <span class=pl-c1>1000</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>plot</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-s1>trn_loss</span>).<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-c1>1000</span>).<span class=pl-c1>mean</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>legends</span>.<span class=pl-c1>append</span>(&#8221;<span class=pl-s1>train</span> <span class=pl-s1>loss</span>&#8221;)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>val_loss</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>plt</span>.<span class=pl-c1>plot</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-s1>val_loss</span>).<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-c1>1000</span>).<span class=pl-c1>mean</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>legends</span>.<span class=pl-c1>append</span>(&#8221;<span class=pl-s1>val</span> <span class=pl-s1>loss</span>&#8221;)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>legend</span>(<span class=pl-s1>legends</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>plt</span>.<span class=pl-c1>close</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Training loop</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>trn_loss</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>val_loss</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>i</span> <span class=pl-c1>in</span> <span class=pl-en>range</span>(<span class=pl-s1>max_step</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>ix</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randint</span>(<span class=pl-c1>0</span>, <span class=pl-v>X_trn</span>.<span class=pl-c1>shape</span>[<span class=pl-c1>0</span>], (<span class=pl-s1>batch_size</span>,), <span class=pl-s1>generator</span><span class=pl-c1>=</span><span class=pl-s1>g</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-v>X_trn</span>[<span class=pl-s1>ix</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Forward pass</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Compute loss</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>cross_entropy</span>(<span class=pl-s1>x</span>, <span class=pl-v>Y_trn</span>[<span class=pl-s1>ix</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>trn_loss</span>.<span class=pl-c1>append</span>(<span class=pl-s1>loss</span>.<span class=pl-c1>item</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Zero gradients to prevent accumulation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>p</span>.<span class=pl-c1>grad</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Backpropagation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>loss</span>.<span class=pl-c1>backward</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>i</span> <span class=pl-c1>&amp;gt;</span> <span class=pl-c1>80000</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>lr</span> <span class=pl-c1>=</span> <span class=pl-c1>1e-4</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Update params</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>p</span> <span class=pl-c1>in</span> <span class=pl-s1>params</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>p</span>.<span class=pl-c1>data</span> <span class=pl-c1>+=</span> <span class=pl-c1>-</span><span class=pl-s1>lr</span> <span class=pl-c1>*</span> <span class=pl-s1>p</span>.<span class=pl-c1>grad</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># Validation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>with</span> <span class=pl-s1>torch</span>.<span class=pl-c1>no_grad</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>ix</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randint</span>(<span class=pl-c1>0</span>, <span class=pl-v>X_val</span>.<span class=pl-c1>shape</span>[<span class=pl-c1>0</span>], (<span class=pl-s1>batch_size</span>,), <span class=pl-s1>generator</span><span class=pl-c1>=</span><span class=pl-s1>g</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-v>X_val</span>[<span class=pl-s1>ix</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>cross_entropy</span>(<span class=pl-s1>x</span>, <span class=pl-v>Y_val</span>[<span class=pl-s1>ix</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>val_loss</span>.<span class=pl-c1>append</span>(<span class=pl-s1>loss</span>.<span class=pl-c1>item</span>())</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-train-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-bengio-train-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>plot_loss</span>(<span class=pl-s1>trn_loss</span>, <span class=pl-s1>val_loss</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/fb39affb5da54656f4587b0391f02cf3/raw/9effeab90e91989621a366d9889db2d865867a97/bengio-train.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/fb39affb5da54656f4587b0391f02cf3#file-bengio-train-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bengio-train.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082466" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bengio-train-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="bengio-train.py">
        <tbody><tr>
          <td id="file-bengio-train-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bengio-train-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">def</span> <span class="pl-en">plot_loss</span>(<span class="pl-s1">trn_loss</span>, <span class="pl-s1">val_loss</span><span class="pl-c1">=</span><span class="pl-c1">None</span>, <span class="pl-s1">title</span><span class="pl-c1">=</span>&#8221;<span class="pl-v">Loss</span> <span class="pl-v">Curves</span>&#8221;):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bengio-train-py-LC2" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">figure</span>(<span class="pl-s1">figsize</span><span class="pl-c1">=</span>(<span class="pl-c1">10</span>, <span class="pl-c1">6</span>)) </td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bengio-train-py-LC3" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">xticks</span>(<span class="pl-s1">fontsize</span><span class="pl-c1">=</span><span class="pl-c1">12</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bengio-train-py-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">yticks</span>(<span class="pl-s1">fontsize</span><span class="pl-c1">=</span><span class="pl-c1">12</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bengio-train-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">title</span>(<span class="pl-s1">title</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bengio-train-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bengio-train-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">legends</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bengio-train-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bengio-train-py-LC9" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">assert</span> <span class="pl-en">len</span>(<span class="pl-s1">trn_loss</span>) <span class="pl-c1">%</span> <span class="pl-c1">1000</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bengio-train-py-LC10" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">plot</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-s1">trn_loss</span>).<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-c1">1000</span>).<span class="pl-c1">mean</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bengio-train-py-LC11" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">legends</span>.<span class="pl-c1">append</span>(&#8221;<span class="pl-s1">train</span> <span class="pl-s1">loss</span>&#8221;)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bengio-train-py-LC12" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bengio-train-py-LC13" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">val_loss</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bengio-train-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">plt</span>.<span class="pl-c1">plot</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-s1">val_loss</span>).<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-c1">1000</span>).<span class="pl-c1">mean</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bengio-train-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">legends</span>.<span class="pl-c1">append</span>(&#8221;<span class="pl-s1">val</span> <span class="pl-s1">loss</span>&#8221;)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bengio-train-py-LC16" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bengio-train-py-LC17" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">legend</span>(<span class="pl-s1">legends</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bengio-train-py-LC18" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">plt</span>.<span class="pl-c1">close</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bengio-train-py-LC19" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bengio-train-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Training loop</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-bengio-train-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">trn_loss</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-bengio-train-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">val_loss</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-bengio-train-py-LC23" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-bengio-train-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">i</span> <span class="pl-c1">in</span> <span class="pl-en">range</span>(<span class="pl-s1">max_step</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-bengio-train-py-LC25" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">ix</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randint</span>(<span class="pl-c1">0</span>, <span class="pl-v">X_trn</span>.<span class="pl-c1">shape</span>[<span class="pl-c1">0</span>], (<span class="pl-s1">batch_size</span>,), <span class="pl-s1">generator</span><span class="pl-c1">=</span><span class="pl-s1">g</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-bengio-train-py-LC26" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-v">X_trn</span>[<span class="pl-s1">ix</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-bengio-train-py-LC27" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-bengio-train-py-LC28" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Forward pass</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-bengio-train-py-LC29" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-bengio-train-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-bengio-train-py-LC31" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-bengio-train-py-LC32" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Compute loss</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-bengio-train-py-LC33" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">cross_entropy</span>(<span class="pl-s1">x</span>, <span class="pl-v">Y_trn</span>[<span class="pl-s1">ix</span>])</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-bengio-train-py-LC34" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">trn_loss</span>.<span class="pl-c1">append</span>(<span class="pl-s1">loss</span>.<span class="pl-c1">item</span>())</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-bengio-train-py-LC35" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-bengio-train-py-LC36" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Zero gradients to prevent accumulation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-bengio-train-py-LC37" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-bengio-train-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">p</span>.<span class="pl-c1">grad</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-bengio-train-py-LC39" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-bengio-train-py-LC40" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Backpropagation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-bengio-train-py-LC41" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">loss</span>.<span class="pl-c1">backward</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-bengio-train-py-LC42" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-bengio-train-py-LC43" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">i</span> <span class="pl-c1">&gt;</span> <span class="pl-c1">80000</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-bengio-train-py-LC44" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">lr</span> <span class="pl-c1">=</span> <span class="pl-c1">1e-4</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-bengio-train-py-LC45" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-bengio-train-py-LC46" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Update params</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-bengio-train-py-LC47" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">p</span> <span class="pl-c1">in</span> <span class="pl-s1">params</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-bengio-train-py-LC48" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">p</span>.<span class="pl-c1">data</span> <span class="pl-c1">+=</span> <span class="pl-c1">-</span><span class="pl-s1">lr</span> <span class="pl-c1">*</span> <span class="pl-s1">p</span>.<span class="pl-c1">grad</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-bengio-train-py-LC49" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-bengio-train-py-LC50" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># Validation</span></td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-bengio-train-py-LC51" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">with</span> <span class="pl-s1">torch</span>.<span class="pl-c1">no_grad</span>():</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-bengio-train-py-LC52" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">ix</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randint</span>(<span class="pl-c1">0</span>, <span class="pl-v">X_val</span>.<span class="pl-c1">shape</span>[<span class="pl-c1">0</span>], (<span class="pl-s1">batch_size</span>,), <span class="pl-s1">generator</span><span class="pl-c1">=</span><span class="pl-s1">g</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-bengio-train-py-LC53" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-v">X_val</span>[<span class="pl-s1">ix</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-bengio-train-py-LC54" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-bengio-train-py-LC55" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-bengio-train-py-LC56" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-bengio-train-py-LC57" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-bengio-train-py-LC58" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">cross_entropy</span>(<span class="pl-s1">x</span>, <span class="pl-v">Y_val</span>[<span class="pl-s1">ix</span>])</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-bengio-train-py-LC59" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">val_loss</span>.<span class="pl-c1">append</span>(<span class="pl-s1">loss</span>.<span class="pl-c1">item</span>())</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-bengio-train-py-LC60" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-train-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-bengio-train-py-LC61" class="blob-code blob-code-inner js-file-line"><span class="pl-en">plot_loss</span>(<span class="pl-s1">trn_loss</span>, <span class="pl-s1">val_loss</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/fb39affb5da54656f4587b0391f02cf3/raw/9effeab90e91989621a366d9889db2d865867a97/bengio-train.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/fb39affb5da54656f4587b0391f02cf3#file-bengio-train-py" class="Link--inTextBlock">
          bengio-train.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><h3>Measuring Performance</h3><p>When we run this, it results in the following loss curve:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QmMg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QmMg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 424w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 848w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 1272w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QmMg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png" width="1000" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/183082981?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QmMg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 424w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 848w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 1272w, https://substackcdn.com/image/fetch/$s_!QmMg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F745a63b2-102a-41eb-8830-b4851d246f57_1000x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see the initial loss is pretty high. Since we are doing softmax classification, we should see an initial loss of around <code>ln(num_classes) = ln(175) = 5.16</code>. We can address this in a later post with proper initialization. The final loss is a little over 5.</p><p>The key performance metric for this task is perplexity. Intuitively, perplexity is how surprised the model is when faced with a given character, conditioned on the distribution it has learned from the training data. Higher perplexity implies higher confusion. Mathematically, perplexity is the exponential of the cross-entropy. The cross-entropy in turn is the expected surprisal. Here is some code for calculating perplexity: </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082505\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bengio-perplexity-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bengio-perplexity.py content, created by cjams on 07:45PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bengio-perplexity.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># Computing perplexity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>with</span> <span class=pl-s1>torch</span>.<span class=pl-c1>no_grad</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>test_strs</span> <span class=pl-c1>=</span> <span class=pl-s1>data_val</span>[<span class=pl-c1>21800</span>:]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>total_nll</span> <span class=pl-c1>=</span> <span class=pl-c1>0.0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>total_tokens</span> <span class=pl-c1>=</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>test_str</span> <span class=pl-c1>in</span> <span class=pl-s1>test_strs</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>seq_nll</span> <span class=pl-c1>=</span> <span class=pl-c1>0.0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>ctx</span> <span class=pl-c1>=</span> [<span class=pl-c1>0</span>] <span class=pl-c1>*</span> <span class=pl-s1>ctx_window</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>idx</span>, <span class=pl-s1>c</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>test_str</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>([<span class=pl-s1>ctx</span>], <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>logits</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>counts</span> <span class=pl-c1>=</span> <span class=pl-s1>logits</span>.<span class=pl-c1>exp</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>probs</span> <span class=pl-c1>=</span> <span class=pl-s1>counts</span> <span class=pl-c1>/</span> <span class=pl-s1>counts</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>, <span class=pl-s1>keepdim</span><span class=pl-c1>=</span><span class=pl-c1>True</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>log_prob</span> <span class=pl-c1>=</span> <span class=pl-s1>probs</span>[<span class=pl-c1>0</span>][<span class=pl-s1>stoi</span>[<span class=pl-s1>c</span>]].<span class=pl-c1>log</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>seq_nll</span> <span class=pl-c1>-=</span> <span class=pl-s1>log_prob</span>.<span class=pl-c1>item</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>ctx</span> <span class=pl-c1>=</span> <span class=pl-s1>ctx</span>[<span class=pl-c1>1</span>:] <span class=pl-c1>+</span> [<span class=pl-s1>stoi</span>[<span class=pl-s1>c</span>]]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>total_nll</span> <span class=pl-c1>+=</span> <span class=pl-s1>seq_nll</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>total_tokens</span> <span class=pl-c1>+=</span> <span class=pl-en>len</span>(<span class=pl-s1>test_str</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-perplexity-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-bengio-perplexity-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>print</span>(<span class=pl-s1>f</span>&#8221;<span class=pl-v>Perplexity</span>: {<span class=pl-s1>torch</span>.<span class=pl-c1>exp</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>([<span class=pl-s1>total_nll</span> <span class=pl-c1>/</span> <span class=pl-s1>total_tokens</span>]))}&#8221;)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/ae4e7ad59c1e98fd6e99587aa7781a97/raw/17ad45666664e67ff75223196e4e7e074dda5a80/bengio-perplexity.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/ae4e7ad59c1e98fd6e99587aa7781a97#file-bengio-perplexity-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bengio-perplexity.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082505" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bengio-perplexity-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="bengio-perplexity.py">
        <tbody><tr>
          <td id="file-bengio-perplexity-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bengio-perplexity-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># Computing perplexity</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bengio-perplexity-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">with</span> <span class="pl-s1">torch</span>.<span class="pl-c1">no_grad</span>():</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bengio-perplexity-py-LC3" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">test_strs</span> <span class="pl-c1">=</span> <span class="pl-s1">data_val</span>[<span class="pl-c1">21800</span>:]</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bengio-perplexity-py-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">total_nll</span> <span class="pl-c1">=</span> <span class="pl-c1">0.0</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bengio-perplexity-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">total_tokens</span> <span class="pl-c1">=</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bengio-perplexity-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bengio-perplexity-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">test_str</span> <span class="pl-c1">in</span> <span class="pl-s1">test_strs</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bengio-perplexity-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">seq_nll</span> <span class="pl-c1">=</span> <span class="pl-c1">0.0</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bengio-perplexity-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">ctx</span> <span class="pl-c1">=</span> [<span class="pl-c1">0</span>] <span class="pl-c1">*</span> <span class="pl-s1">ctx_window</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bengio-perplexity-py-LC10" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bengio-perplexity-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">idx</span>, <span class="pl-s1">c</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">test_str</span>):</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bengio-perplexity-py-LC12" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>([<span class="pl-s1">ctx</span>], <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bengio-perplexity-py-LC13" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bengio-perplexity-py-LC14" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bengio-perplexity-py-LC15" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bengio-perplexity-py-LC16" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bengio-perplexity-py-LC17" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">logits</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bengio-perplexity-py-LC18" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">counts</span> <span class="pl-c1">=</span> <span class="pl-s1">logits</span>.<span class="pl-c1">exp</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bengio-perplexity-py-LC19" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">probs</span> <span class="pl-c1">=</span> <span class="pl-s1">counts</span> <span class="pl-c1">/</span> <span class="pl-s1">counts</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>, <span class="pl-s1">keepdim</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bengio-perplexity-py-LC20" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">log_prob</span> <span class="pl-c1">=</span> <span class="pl-s1">probs</span>[<span class="pl-c1">0</span>][<span class="pl-s1">stoi</span>[<span class="pl-s1">c</span>]].<span class="pl-c1">log</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-bengio-perplexity-py-LC21" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">seq_nll</span> <span class="pl-c1">-=</span> <span class="pl-s1">log_prob</span>.<span class="pl-c1">item</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-bengio-perplexity-py-LC22" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">ctx</span> <span class="pl-c1">=</span> <span class="pl-s1">ctx</span>[<span class="pl-c1">1</span>:] <span class="pl-c1">+</span> [<span class="pl-s1">stoi</span>[<span class="pl-s1">c</span>]]</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-bengio-perplexity-py-LC23" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-bengio-perplexity-py-LC24" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">total_nll</span> <span class="pl-c1">+=</span> <span class="pl-s1">seq_nll</span></td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-bengio-perplexity-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">total_tokens</span> <span class="pl-c1">+=</span> <span class="pl-en">len</span>(<span class="pl-s1">test_str</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-bengio-perplexity-py-LC26" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-perplexity-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-bengio-perplexity-py-LC27" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">print</span>(<span class="pl-s1">f</span>&#8221;<span class="pl-v">Perplexity</span>: {<span class="pl-s1">torch</span>.<span class="pl-c1">exp</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>([<span class="pl-s1">total_nll</span> <span class="pl-c1">/</span> <span class="pl-s1">total_tokens</span>]))}&#8221;)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/ae4e7ad59c1e98fd6e99587aa7781a97/raw/17ad45666664e67ff75223196e4e7e074dda5a80/bengio-perplexity.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/ae4e7ad59c1e98fd6e99587aa7781a97#file-bengio-perplexity-py" class="Link--inTextBlock">
          bengio-perplexity.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>This simple model we&#8217;ve built gives a perplexity of 434. In the follow-up posts we will see how we can get this number down with architectural improvements.</p><p>Finally, lets look at the quality of the story that is generated by our model. For this we will sample from the model one character at a time according to the learned distribution:</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist144082521\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bengio-sample-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bengio-sample.py content, created by cjams on 07:46PM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;4\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bengio-sample.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>story</span> <span class=pl-c1>=</span> &#8216;&#8217;</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>ctx</span> <span class=pl-c1>=</span> [<span class=pl-c1>0</span>] <span class=pl-c1>*</span> <span class=pl-s1>ctx_window</span> <span class=pl-c># start with context full of &#8220;special&#8221; characters</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>while</span> <span class=pl-c1>True</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>([<span class=pl-s1>ctx</span>], <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>layer</span> <span class=pl-c1>in</span> <span class=pl-s1>model</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>layer</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>counts</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>exp</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>probs</span> <span class=pl-c1>=</span> <span class=pl-s1>counts</span> <span class=pl-c1>/</span> <span class=pl-s1>counts</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>1</span>, <span class=pl-s1>keepdim</span><span class=pl-c1>=</span><span class=pl-c1>True</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>i</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>multinomial</span>(<span class=pl-s1>probs</span>, <span class=pl-s1>num_samples</span><span class=pl-c1>=</span><span class=pl-c1>1</span>, <span class=pl-s1>replacement</span><span class=pl-c1>=</span><span class=pl-c1>True</span>, <span class=pl-s1>generator</span><span class=pl-c1>=</span><span class=pl-s1>g</span>).<span class=pl-c1>item</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>s</span> <span class=pl-c1>=</span> <span class=pl-s1>itos</span>[<span class=pl-s1>i</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>story</span> <span class=pl-c1>+=</span> <span class=pl-s1>s</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>ctx</span> <span class=pl-c1>=</span> <span class=pl-s1>ctx</span>[<span class=pl-c1>1</span>:] <span class=pl-c1>+</span> [<span class=pl-s1>i</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>i</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>break</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bengio-sample-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bengio-sample-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>print</span>(<span class=pl-s1>f</span>&#8221;<span class=pl-v>Story</span> <span class=pl-s1>time</span>: {<span class=pl-s1>story</span>}&#8221;)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/cjams/e02cd8bf2aa86e56a7028c3811047b55/raw/a9b5efe85fe30e419b5fe9b849e26abc3e738eef/bengio-sample.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/cjams/e02cd8bf2aa86e56a7028c3811047b55#file-bengio-sample-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bengio-sample.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-ed91f9610ae6.css"><div id="gist144082521" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bengio-sample-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="4" data-paste-markdown-skip="" data-tagsearch-path="bengio-sample.py">
        <tbody><tr>
          <td id="file-bengio-sample-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bengio-sample-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">story</span> <span class="pl-c1">=</span> &#8216;&#8217;</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bengio-sample-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">ctx</span> <span class="pl-c1">=</span> [<span class="pl-c1">0</span>] <span class="pl-c1">*</span> <span class="pl-s1">ctx_window</span> <span class="pl-c"># start with context full of &#8220;special&#8221; characters</span></td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bengio-sample-py-LC3" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bengio-sample-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">while</span> <span class="pl-c1">True</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bengio-sample-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>([<span class="pl-s1">ctx</span>], <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bengio-sample-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bengio-sample-py-LC7" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">layer</span> <span class="pl-c1">in</span> <span class="pl-s1">model</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bengio-sample-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">layer</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bengio-sample-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bengio-sample-py-LC10" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">counts</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">exp</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bengio-sample-py-LC11" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">probs</span> <span class="pl-c1">=</span> <span class="pl-s1">counts</span> <span class="pl-c1">/</span> <span class="pl-s1">counts</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">1</span>, <span class="pl-s1">keepdim</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bengio-sample-py-LC12" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">i</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">multinomial</span>(<span class="pl-s1">probs</span>, <span class="pl-s1">num_samples</span><span class="pl-c1">=</span><span class="pl-c1">1</span>, <span class="pl-s1">replacement</span><span class="pl-c1">=</span><span class="pl-c1">True</span>, <span class="pl-s1">generator</span><span class="pl-c1">=</span><span class="pl-s1">g</span>).<span class="pl-c1">item</span>()</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bengio-sample-py-LC13" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">s</span> <span class="pl-c1">=</span> <span class="pl-s1">itos</span>[<span class="pl-s1">i</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bengio-sample-py-LC14" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">story</span> <span class="pl-c1">+=</span> <span class="pl-s1">s</span></td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bengio-sample-py-LC15" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">ctx</span> <span class="pl-c1">=</span> <span class="pl-s1">ctx</span>[<span class="pl-c1">1</span>:] <span class="pl-c1">+</span> [<span class="pl-s1">i</span>]</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bengio-sample-py-LC16" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bengio-sample-py-LC17" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">i</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span>:</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bengio-sample-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">break</span></td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bengio-sample-py-LC19" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bengio-sample-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bengio-sample-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-k">print</span>(<span class="pl-s1">f</span>&#8221;<span class="pl-v">Story</span> <span class="pl-s1">time</span>: {<span class="pl-s1">story</span>}&#8221;)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/cjams/e02cd8bf2aa86e56a7028c3811047b55/raw/a9b5efe85fe30e419b5fe9b849e26abc3e738eef/bengio-sample.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/cjams/e02cd8bf2aa86e56a7028c3811047b55#file-bengio-sample-py" class="Link--inTextBlock">
          bengio-sample.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Running this gives the following:</p><blockquote><p>Story time: Once upon a timk.e an ter pere eo hire sores the caed va boanshThr witlin. HtA. ThengerDpg, and ker ditgy und g nit, tayeur Vag anddbT&#201;dbvjogd isswarp!e wow,e. ouancs.&#8221;Tneyd-4%un6&#184;&#164;&#338;&#194;&#183;&#175;&#9;}&#9;Iy&#382;+&#8225;+&#8250;&#180;&#162;&#191;D&#187;&#225;jf&#201;&#381;&#176;&#233;G&#173;&#8482;yz&#8250;1&#338;&#194;&#353;&#175;&#187;{U9&#172;#&#179;&#8217;}&nbsp;%&gt;&#178;)&#184;&#8216;&#172;#&#339;j;&#202;q&gt;&#8216;&#230;&#201;&#181;Lb&#230;&#228;c&#174;&#232;.c&#381;39&#176;zc&#183;dxnomd.&#402;&gt;o&#166;t.mTe su&#338;lmvcyI&#162;&#8221;D&#225;&#339;&#8211;j&#339;&#179;;&#191;&#228;X&#233;cv&#8482;&#166;R&#402;&#184;2&#8217;F&#8249;&nbsp;@&#8250;&#8221;&#402;&#195;&#8250;6&#177;z&#353;&lt;&#176;b&#201;;&#174;&#174;&#210;`0&#9;?.&#196;#2&#187;&#225;B&#8221;&#183;&#8221;&#226;2&#180;F&#185;&#8230;&#165;&#174;@12&#167;9\&gt;&#710;&#167;&#163;V}&#229;4&#185;&#8364;F&#233;Q}&#166;&#169;&#161;&#168;&#177;&#188;&#175;&#8224;&#195;))`&#188;&#201;\Rz&#228;&#161;\&#172;#;&#179;Y&#376;&#176;vVL&#226;%&#196;&lt;Z&#230;&#179;&#175;&#233;O&#8218;&#195;M&#382;+`[&#8221;&#230;C&#226;j,C&#209;S&#352;\,&#185;&#9;]O&#226;&#172;&#732;&lt;!&#230;&#230;&#210;&#175;Y&#230;&#161;&#710;9&#202;&#239;4g$&#189;?&#196;b&#239;&#201;?oBH&#732;&#228;&#177;&#9;;&#227;R&gt;@)&#402;&#8240;&#710;=X&#240;&#165;&#185;P,?0=&gt;&#381;&#240;:&#8221;QW&#176;JFxQ(3\h&#8222;&#352;&#240;&#201;)X&#732;&#180;QD&#181;xj&#187;.&#162;&#201;?&#353;&#172;&#170;Rc&#179;&#352;&#239;&#352;&#172;&#173;qU&#162;E&#185;&#162;&#339;R0&#8240;2&#376;&#240;:&#381;+&#197;4&#161;&#186;^</p></blockquote><p>Started off pretty good! Then fell completely off the tracks.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[hacker (n.)]]></title><description><![CDATA[What do you think of when you hear the word &#8220;hacker&#8221;?]]></description><link>https://www.connorjdavis.com/p/hacker-n</link><guid isPermaLink="false">https://www.connorjdavis.com/p/hacker-n</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Sun, 07 Sep 2025 19:26:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do you think of when you hear the word &#8220;hacker&#8221;? Probably a person in a black room in a black hoodie staring at a screen that looks like the matrix with lime green text flowing not horizontally, but vertically.</p><p>The notion that hacking involves illicit activity with a computer is wrong. Or rather, it is incomplete. Our culture has hijacked this term and given it this negative connotation.</p><p>So what does it actually mean?</p><p>To answer this, we have to go back to the early 1970s. During this time, the creators and users of the early Internet at various universities created a file that contained certain terms related to the systems they were working on. This file became known as the <a href="http://www.catb.org/jargon/html/pt01.html">Jargon File</a>. The Jargon File contains many terms commonly used in hacker subcultures, and some such as &#8220;bug&#8221; and &#8220;troll&#8221; have had  widespread adoption by society today. The Jargon File offers several definitions of hacker:</p><blockquote><p>hacker: (n) A person who enjoys exploring the details of programmable systems and how to stretch their capabilities, as opposed to most users, who prefer to learn only the minimum necessary. RFC1392, the Internet Users&#8217; Glossary, usefully amplifies this as: A person who delights in having an intimate understanding of the internal workings of a system, computers and computer networks in particular.</p><p>&#8230;</p><p>An expert or enthusiast of any kind. One might be an astronomy hacker, for example.</p><p>&#8230;</p><p>One who enjoys the intellectual challenge of creatively overcoming or circumventing limitations.</p><p>&#8230;</p><p> [deprecated] A malicious meddler who tries to discover senstive information by poking around. Hence password hacker, network hacker. The correct term for this sense is <a href="http://www.catb.org/jargon/html/C/cracker.html">cracker</a>.</p></blockquote><p>This definition does a much better job of capturing the true essence of hacking. It underscores the essential traits of hacking: deep understanding, competence, curiosity, and creativity. A hacker is someone who comes to <a href="http://www.catb.org/jargon/html/G/grok.html">grok</a> a system so deeply, they can discover novel ways of (ab)using it. They are especially skilled at combining systems in unexpected ways, beyond the original design of either system, to overcome the boundaries and limitations of either system. </p><p>The only thing I would add to the definition above is that hacking applies to any complex system, not just computer systems. And in this sense I agree with society&#8217;s generalization of the word &#8220;hack&#8221; to domains outside of computers. There are biohackers, gym hackers, growth hackers, food hackers, work hackers, life hackers, etc. Hacking is a mindset which can be applied to any domain of life in which there are observable inputs and outputs. Hackers try a bunch of inputs and observe how the system behaves. Even if the internals of the system are a complete black box, you can learn a great deal from just observing what inputs lead to what outputs.</p><p>The difference between a hacker and non-hacker is that the hacker</p><ol><li><p>becomes consciously aware of the inputs to the system</p></li><li><p>intentionally finds and feeds <em>unconventional</em> inputs to it.</p></li></ol><p>Cold showers are a good example of this. Cold showers were an unconventional input to the standard morning routine. Then some influential biohackers tried them out and found they provide many health benefits that you would miss out on if you stuck to the conventional input of a warm shower. Sometimes the unconventional input is no input at all. This is colloquially referred to as <a href="https://www.paulgraham.com/lesson.html">hacking the test</a>.</p><p>Hacking the test involves pruning the inputs that involve extra effort, because you&#8217;ve figured out that the  system doesn&#8217;t reward you for them. Or that you can still achieve a favorable outcome without them. This is what business hacker Charlie Munger meant by &#8220;show me the incentives and I&#8217;ll show you the outcome&#8221;. The incentives are the feedback from the output to the input. When someone says they have no incentive to do something, they&#8217;ve figured out that they can safely prune that something from the input with no detrimental effect to the outcome. AI systems do this as well, especially in reinforcement learning. This &#8220;reward hacking&#8221; involves finding <a href="https://www.reddit.com/r/singularity/comments/1d1k1u4/example_of_reward_hacking_ai_learns_a_trick_in_a/">clever bypasses</a> around the rules of their environment. These workarounds are examples of another characteristic of hacking. Hackers play with the rules rather than within the rules of the system. These examples show that hacking itself is similar to a technology, a tool that can be used for good or bad.</p><p>Since hacking is a mindset, a way of questioning the world, it can be learned. It is a skill that can be devleoped. It is a skill that I intend to keep developing in myself and to help others develop it as well.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Life Lately]]></title><description><![CDATA[This is my first post in a while.]]></description><link>https://www.connorjdavis.com/p/life-lately</link><guid isPermaLink="false">https://www.connorjdavis.com/p/life-lately</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Fri, 22 Aug 2025 19:14:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is my first post in a while.</p><p>Lately I&#8217;ve been developing a <a href="https://github.com/mycomize/grow-app">mushroom growing app</a>. This is an open-source project that enables citizen science for mushroom growing. For a while now, I have wanted a tool that allows me to keep track of my grows and to run experiments with various controllable parameters. I&#8217;ve also wanted to be able to learn from others&#8217; techniques more effectively. There are plenty of forums online for sharing, however it is difficult to rigorously repeat what others have done, as the sharing tends to happen in unstructured natural language and often lacks critical details needed for repeatability.</p><p>The app has three conceptual sections: Grows, Teks, and IoT Gateways. The Grows section provides grow management. You can add/remove grows, schedule tasks, view grow progress and key characterstics like health status, cost, and yield. The app focuses on bulk grows for the initial version. Each grow has 5 stages. The user can add items, environmental conditions, tasks, and notes to each stage.</p><p>Teks (&#8221;tek&#8221; is short for &#8220;technique&#8221;; commonly found in online mushroom forums) are similar to grows, but they are a more &#8220;abstract&#8221; version, similar to a recipe vs doing the actual cooking. Teks are portable instructions that can be shared and instantiated into actual grows.</p><p>IoT Gateways are connections to Internet-of-Things (IoT) sensor hubs. <a href="https://www.home-assistant.io/">Home Assistant</a> is the one supported in the initial version. Home Assistant is nice because it is a widely used free and open-source home automation platform and is one I&#8217;m fairly familiar with. You can connect many different sensors, especially those relevant to providing optimal mushroom growing conditions (temperature, humidity, carbon dioxide, pH, light, etc). You can also write custom automations, so that you can react to the environment changing. For example, if your spawn incubator temperature falls too low, it can switch on a heat mat to bring the temp back into optimal range. Home Assistant also supports cameras and embedded controllers such as <a href="https://esphome.io/">ESPHome</a>, and it is extendable so support can be added for just about anything if it doesn&#8217;t exist already. The IoT Gateway connects to the Home Assistant API and receives real-time updates over WebSockets. All the controls, sensors, and automations are accessible from within the app. They can be linked to specific grow stages for different tasks, like enforcing environmental conditions (e.g. temperature and humidity range) during spawn colonization, or detecting anomalies in sensor values (e.g. discoloration, pH drops) during bulk colonization. This Bring-Your-Own-Home-Assistant design provides a platform for endless customization and experimentation. </p><p>These three sections will be present in the first version of the app, as it is already getting pretty complex. Later I would like to add &#8220;Lab&#8221; section that provides first-class support for running experiments. You should be able to have a grow, and tweak some aspect of it, say a different substrate material, or a slightly higher temperature range during bulk colonization, or more light. The space of potential experimental inputs is massive here, so providing a clean &#8220;diffing&#8221; mechanism that allows the user to easily compare and contrast the control vs experimental variables and track how they effect the outputs like yield is what I&#8217;m thinking of. </p><p>---</p><p>Lately I&#8217;ve been thinking about work, my relationship to it, and what it means to do great work. Work is important to me, and I&#8217;ve learned over the past 2 years or so that doing <em>great</em> work is really important to my wellbeing. I feel like shit when I know inside that I haven&#8217;t lived up to my potential. I&#8217;ve read a lot books lately related to work, some of which I&#8217;ve <a href="https://connorjdavis.substack.com/p/book-review-the-creative-act?r=1nb12u">written</a> <a href="https://connorjdavis.substack.com/p/book-review-mans-search-for-meaning?r=1nb12u">about</a> and <a href="https://www.amazon.com/Tiny-Experiments-Freely-Goal-Obsessed-World/dp/0593715136">others I haven&#8217;t</a> (yet), that have helped me reframe what it means to work and work well. Paul Graham&#8217;s <a href="https://paulgraham.com/greatwork.html">take on work</a> has been especially influential on me.</p><p>Strangely, work tends to get a bad rap in today&#8217;s society. Ambition has become somewhat of a dirty word. But I think a lot of this is because people conflate their job (or, if they are young, school) with work. And if your job sucks, or you&#8217;re learning things in school that you have zero interest in, then naturally you&#8217;re going to dread going to &#8220;work&#8221;. You are going to resort to language such as I <em>have</em> to go to work tomorrow, rather than I <em>get</em> to go to work tomorrow. The underlying <a href="https://connorjdavis.substack.com/p/metaphors-and-the-infinite-game?r=1nb12u">metaphor</a> is that WORK IS A NECESSARY EVIL. Something to put up with. Something that society forces us into. Lately I&#8217;ve come to change that metaphor. For me, WORK IS AN OPPORTUNITY  FOR SPIRITUAL GROWTH. It is something that I am blessed to be able to do. Work to me means learning,  creating great things, and sharing what I&#8217;ve learned and built with others. Great work probably look different for you, but it is always guided by our natural curiosity and comes from our internal drive. And we have an obligation to ourselves and to each other to ensure that our curiosity and interests are fully expressed in the world. We must share our work, because each piece plays a small part of a much larger creative force that works in ways that we do not fully understand. You never know how your work could impact a person&#8217;s life.</p><p>The trick is to maintain balance between your work and the other important things in life. I have found that viewing your life&#8217;s work as an <a href="https://connorjdavis.substack.com/p/metaphors-and-the-infinite-game?r=1nb12u">infinite game</a> helps in maintaining balance. The alternative is that work is a finite game, which creates a frantic sense of urgency and anxiety. Work is important, but it shouldn&#8217;t interfere with sleep, relationships, diet, and exercise. The goal of any infinite game is to keep playing, and you have to take care of yourself in order to that.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Nth Order Effects]]></title><description><![CDATA[What do tariffs and the meteoric rise of Oliver Anthony have in common?]]></description><link>https://www.connorjdavis.com/p/nth-order-effects</link><guid isPermaLink="false">https://www.connorjdavis.com/p/nth-order-effects</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Wed, 25 Jun 2025 19:02:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xpv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What do tariffs and the meteoric rise of <a href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.youtube.com/watch%3Fv%3DsqSA-SY5Hro&amp;ved=2ahUKEwjy9p3C542NAxXtIjQIHU0WNbIQtwJ6BAgXEAI&amp;usg=AOvVaw2VLR3w5EP5tN50F4jFCWpz">Oliver Anthony</a> have in common? One is a sledgehammer that the Trump administration is hell-bent in swinging to reorder global trade; the other is a country music star with a message dripping in populism.</p><p>My aim is to show that both of these phenomena, while very different on the surface, are inextricable <em>nth order effects</em> of the same root cause. The root cause is something that the majority of Americans either take for granted or are completely ignorant to: the US dollar&#8217;s status as the global reserve currency and the debt-based fiat monetary system on which this status depends. An <a href="https://fs.blog/second-order-thinking/">nth order effect</a> is, simply put, an indirect consequence of some event. The set of these effects for a given event is constructed by repeatedly asking the question &#8220;And then what?&#8221;. If I eat candy every day, the first order effect is a rush of pleasure and dopamine. The second order effect is stomach distress and low energy. A third order effect is diabetes. Many people sometimes fail to look beyond the first order effects of decisions they make and of events they become aware of. This leads to a shallow understanding of events that occur within complex systems and an under-appreciation of potential future outcomes.</p><p>The Trump administration wants to balance the United States&#8217; long-standing trade deficit. To achieve this policy goal, Trump is levying tariffs on trade partners in order to encourage the re-shoring and revitalization of the US industrial base. However, in my view, tariffs alone will be ineffective in balancing trade in the long run, because the trade deficit is an nth order effect of the true root cause: the US dollar status as global reserve currency and the debt-based fiat monetary system. Tariffs do nothing to address this root cause, and in fact, <a href="https://youtu.be/VnajhDMAWVA">Trump has emphasized</a> the administration&#8217;s official policy is to <em>preserve </em>the US dollar&#8217;s reserve currency status. However, just because tariffs alone may be ineffective, doesn&#8217;t necessarily mean the underlying policy objective shouldn&#8217;t be pursued.</p><p>The aim of this post is not a political one, but rather to analyze the tariff policy within the reality of the economic system it is being deployed in so that we can make informed judgments of its efficacy and worth. To this end, we will start with a brief history of the modern dollar system and the elevation of the dollar to global reserve status. Then we will see how this system is linked to persistent US trade deficits. Once we understand this financial plumbing, we can make an informed judgement on whether tariffs alone are likely to be effective. In order to judge whether the underlying policy (i.e. balanced trade) is worthwhile, we will look to various other consequences of the current system and the impact it has had on the American public, including the recent wave of political populism and general feeling that the American dream is out of reach for many people. With these effects correctly attributed to the root cause, which transcends political divisions, we can develop a more nuanced view of the underlying goal of balanced trade, even if we may disagree with the means to achieve it.</p><p>A quick aside: much of my view has been influenced by <a href="https://x.com/LynAldenContact">Lyn Alden</a> and her book <a href="https://www.amazon.com/Broken-Money-Financial-System-Failing/dp/B0CG8985FR">Broken Money</a>, as well as work from macroeconomists <a href="https://x.com/LukeGromen">Luke Gromen</a>, <a href="https://x.com/JohnFMauldin">John Mauldin</a>, <a href="https://x.com/infraa_">@infraa\_</a>. I recommend checking out their stuff for a further deep dive.</p><h2>The Rise of King Dollar</h2><p>At the peak of World War II, Allied financiers and central bankers convened in Bretton Woods, New Hampshire to create a new global monetary order to be enacted once the war was over. At this point in July of 1944, the United States had emerged as the country best positioned for imposing a new international financial system on the rest of the world. The US had built up an immense manufacturing capacity and the American homeland had avoided the devastating damage that plagued Europe, Asia, and the Soviet Union.</p><p>The so-called &#8220;Bretton Woods system&#8221; would confer the lofty status of global reserve currency to the United States dollar. Under Bretton Woods, member nations agreed to maintain a fixed exchange rate against the dollar, while the dollar itself was redeemable for gold at $35 per ounce. The intent of this system was to facilitate free international trade using the dollar as the common medium of exchange.</p><p>There was a problem though. In order for nations to settle trade in dollars with each other, they needed dollars. But dollars only come from the United States. Foreign nations and companies cannot create more dollars themselves, and Europe, Asia, and the Soviet Union had very little export capacity relative to the United States due to the physical destruction from the war, i.e., they had very little to sell in exchange for dollars.</p><p>At the very beginning of Bretton Woods, the United States had what is called a <em>balance of payments</em> surplus. You can think of a country&#8217;s balance of payments as a record of all the value flowing into and out of the country. A surplus means that more value is flowing in than is flowing out. The balance of payments includes things like the trade balance and the financial account. The US was running huge trade surpluses and had amassed sizeable dollar reserves. This surplus had to be reversed in order to supply the rest of the world with dollars that it needed. It did this initially through grants and loans, since the European export capacity was severely hamstrung at the time.</p><p>This deliberate outflow of dollars was bootstrapped by massive aid packages such as the Marshall Plan, during which the US gave Western European countries approximately $17 billion to jumpstart re-industrialization. <a href="https://www.banque-france.fr/en/publications-and-statistics/publications/lessons-marshall-plan-european-recovery-plan">About 90% of the Marshall Plan aid was given in the form of grants</a> rather than loans, for which Europe didn&#8217;t have to pay the United States back. This injection of dollars helped spur European industrial capacity, and by the end of the Marshall Plan in 1952, <a href="https://eprints.lse.ac.uk/22351/1/wp78.pdf">European exports had doubled</a>, and <a href="https://www.everycrsreport.com/reports/R45079.html#_Toc504143866">industrial output had increased 55%</a>.</p><p>These persistent balance of payments deficits, i.e., export of dollars, and the impetus to use dollars for global trade by non-US countries, created what is called the <em>Eurodollar</em> market. Despite the name, the Eurodollar market is the market for dollars between any nation that is not the US, not just in Europe. The Eurodollar market experienced dramatic growth during the 1960s. It grew from <a href="https://www.stlouisfed.org/on-the-economy/2022/january/bretton-woods-growth-eurodollar-market">$75 billion to $264 billion</a> (in 2020 dollars) between 1964 to 1969. Keep in mind that each dollar deposit in a foreign country ultimately represents a liability of the United States, since every dollar was redeemable for gold by foreign governments and their central banks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1HiH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1HiH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 424w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 848w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 1272w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1HiH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png" width="1024" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138059,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1HiH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 424w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 848w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 1272w, https://substackcdn.com/image/fetch/$s_!1HiH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9d89e4-04ca-4be9-83df-188d3f3d664c_1024x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: https://www.thegoldobserver.com/p/gold-wars-the-us-versus-europe-during</figcaption></figure></div><p>As seen above, eventually the level of external dollar liabilities exceeded the value of the US gold supply. This coincided with large inflationary deficit spending by the United States in the late 1960s in order to fund the Vietnam War and President Johnson&#8217;s Great Society. The deficit increased 5x during this period, from <a href="https://www.elibrary.imf.org/display/book/9781475506969/ch08.xml">$5.9 billion in 1964 to $25.2 billion in 1968</a>.</p><p>There are a couple of reasons why the foreign (Eurodollar) and domestic liabilities were able to grow so quickly under Bretton Woods.</p><p>The first dates back to three decades earlier, when FDR declared an emergency bank &#8220;holiday&#8221; and signed a series of executive orders during the 1933 banking crisis. <a href="https://www.presidency.ucsb.edu/documents/executive-order-6102-forbidding-the-hoarding-gold-coin-gold-bullion-and-gold-certificates">Executive Order 6102</a> made ownership of over $100 worth of gold in the form of currency illegal for US citizens. <a href="https://www.presidency.ucsb.edu/documents/executive-order-6073-reopening-banks">Executive Order 6073</a> ended the domestic convertibility of dollars for gold at US banks. Later that year in June, <a href="https://www.ebsco.com/research-starters/history/gold-clause-repealed">Congress revoked the &#8220;gold clause&#8221;</a>, which nullified any obligations for settling private and federal contracts in gold. Finally, in 1934, the <a href="https://www.cato.org/blog/new-deal-recovery-part-7-fdr-gold">Gold Reserve Act</a> enabled the President to perform a <em>de jure</em> devaluation of the dollar by adjusting the gold peg from $20.67 to $35 per ounce. These policies amounted to confiscation of gold from the American public into the Federal Reserve system and marked the end of the gold standard domestically.</p><p>At the same time gold flowed into the Fed from private citizens in the 1930s, the required gold reserve ratio decreased throughout the Bretton Woods period. From the establishment of the Fed in 1913 through 1945, the Fed and its member banks were legally required to maintain a 40% gold reserve. <a href="https://ypfsresourcelibrary.blob.core.windows.net/fcic/YPFS/65585_1965-1969.pdf">This was reduced to 25% in 1945, and to 0% in 1968</a>. The net effect was the US banking system was no longer constrained by their gold levels and could create loans based on fiat dollar reserves rather than physical gold reserves. This leads us to the second reason for the huge expansion of dollar liabilities under Bretton Woods: fractional reserve lending.</p><p>Fractional reserve lending allows banks to create new money in the form of loans while only keeping a small percentage of the initial deposit as physical reserves<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Assuming all banks have a reserve ratio of 10%, if you deposit $100 into the bank, the bank will loan out $90 of that deposit to a borrower while earning the spread between the interest it pays you and the interest is charges the borrower. That $90 then gets loaned again by another bank with a 10% reserve ratio, resulting in $81 for the new loan. This process tends to continue via the profit motive for banks to earn the interest spread between deposits and loans. If you do the math, the amount of new money created from this process (aka broad money), works out to be equal to: initial deposit * (1 / reserve ratio). So a $100 deposit with a 10% reserve ratio results in $1000 of new broad money created on top of the $100 base money<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. This is why base money, like actual physical currency and checking accounts, is referred to as &#8220;high powered money&#8221;; it has a multiplicative effect on the overall supply of broad money in the system.</p><p>So you can see how the end of domestic gold convertibility combined with fractional reserve lending lead to enormous sums of new broad money creation. This money creation process accelerated under Bretton Woods as the United States began exporting dollars to the rest of the world, and the Eurodollar market began to metastasize. These offshore US dollars were also expanded using fractional reserves as industrial capacity came online in Europe.</p><p>Eventually Western European nations sniffed out the unsustainability of this system. There were simply way to many dollars and not enough gold in the US. The suspension of the gold standard in the 1930s only applied to citizens and banks within the country. Foreign governments and their central banks were still able to redeem their dollars for gold. So that is exactly what they did. Nations like France and Germany began draining gold from the US in order to evade the monetary debasement that was occurring in the dollar:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zuIf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zuIf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 424w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 848w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 1272w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zuIf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png" width="1024" height="521" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6431a377-c17a-4147-a851-f15564f526b3_1024x521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:521,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29367,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zuIf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 424w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 848w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 1272w, https://substackcdn.com/image/fetch/$s_!zuIf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6431a377-c17a-4147-a851-f15564f526b3_1024x521.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This run on the United States&#8217; gold stock was so acute in the late 1960s that it prompted the so-called &#8220;Nixon shock&#8221;. In 1971, President Nixon announced that the US would no longer allow foreign governments to redeem dollars for gold, ending the gold standard completely. The Nixon shock was effectively a default by the United States and marked the end of the Bretton Woods regime.</p><h2>The Dollar and Trade(offs)</h2><p>The collapse of Bretton Woods was predicted in 1959 by an economist named Robert Triffin. Triffin identified a contradiction, called Triffin&#8217;s dilemma, inherent to the system: in order for the rest of the world to use the dollar in international trade, foreign countries must have dollars. These countries can&#8217;t print dollars, so they have to get the dollars through continuous US trade deficits (or more generally, balance of payments deficits, which includes things like foreign aid as in the Marshall Plan). Trade deficits provide foreign countries with the surplus dollars needed to settle trade.</p><p>Now the problem isn&#8217;t really with the occasional trade deficit. In a normal scenario, the excess supply of the currency from the deficit will put downward valuation pressure on it relative to other currencies. This devaluation causes the country&#8217;s exports to become more competitive on a relative basis. As exports become more competitive, the trade deficit turns into a balance or even a surplus.</p><p>However, when the country (US) maintains the world reserve currency (dollar), this natural equilibrium is never reached. The excess supply is soaked up by the global demand, keeping the relative value of the dollar propped up. The key word here is relative. On a relative basis, the dollar is strong, however in absolute real terms, the dollar gets much weaker over time. The one-two punch of abandoning the gold standard and the proliferation of debt through fractional reserve lending leads to massive loss of purchasing power for holders of US dollars over the course of decades. This weakness, as starkly demonstrated in the graph below, undermines the collective belief that the dollar is worthy as being the global reserve currency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xpv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xpv8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 424w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 848w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 1272w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xpv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png" width="1024" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xpv8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 424w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 848w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 1272w, https://substackcdn.com/image/fetch/$s_!xpv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ceb9432-80a7-4778-973f-5cc7f0f0c6b4_1024x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Triffin&#8217;s dilemma captures the dynamic between the global reserve currency and trade deficits that I think Trump doesn&#8217;t fully understand or at least doesn&#8217;t appreciate. The deficits are a consequence of the US dollar&#8217;s global reserve currency status. They don&#8217;t just happen randomly or because of mercantilist policies from China. Sure China&#8217;s policies such as the intentional devaluation of the yuan may exacerbate the deficit, but these effects are marginal relative to the structural flows that result from the dollar reserve status.</p><p>Any meaningful tariff increases have little staying power, because they do nothing to address the root cause of the deficits in the first place. The bottom line is that dollars must continue to flow to the rest of the world. The reason is that there have been trillions of dollars worth of debt that has accumulated outside the US over the past several decades, where neither the creditor nor the debtor are US-based entities. These so-called &#8220;offshore&#8221; debts are between foreign corporations, households, banks, and governments. The charts below from the <a href="https://www.atlantafed.org/-/media/documents/research/publications/policy-hub/2024/05/15/02--offshore-dollar-and-us-policy.pdf">Atlanta Fed</a> show the magnitude of these liabilities to be in the tens of trillions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!exJJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!exJJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 424w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 848w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 1272w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!exJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png" width="813" height="373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:373,&quot;width&quot;:813,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!exJJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 424w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 848w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 1272w, https://substackcdn.com/image/fetch/$s_!exJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b88ec0-5fd6-4e8f-80be-38048c29a49a_813x373.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">US Dollar Credit to Nonbanks outside the United States. Source: <a href="https://www.atlantafed.org/-/media/documents/research/publications/policy-hub/2024/05/15/02--offshore-dollar-and-us-policy.pdf">Atlanta Fed</a></figcaption></figure></div><p>Except for the roughly $3T in the bottom left graph, all of these liabilities are held outside the United States. The interest payments on these debts create inelastic demand for US dollars. When trade is free-flowing, the dollar income can be used to make the interest payments, no problem. But if trade income falls due to massive tariffs (or other shocks like COVID-era supply chain disruptions), dollars have to be sourced from somewhere else.</p><p>Where do the dollars come from in the absence of trade? Liquid US assets, mainly stocks and bonds<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. When countries run a trade surplus with the US, they reinvest much of their dollar profits back into the shares of US companies and US government debt. The degree of this foreign ownership of US assets is captured in the United States&#8217; net international investment position (NIIP). A negative NIIP means that foreign countries own more American assets than America owns of foreign assets. We can also look directly at foreign investment in US equity and debt markets. The graphs below illustrate growth of this foreign ownership of US assets over the past few decades:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Ph3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Ph3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 424w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 848w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 1272w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Ph3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png" width="1024" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36096,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Ph3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 424w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 848w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 1272w, https://substackcdn.com/image/fetch/$s_!4Ph3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72aaf164-80c4-409d-8e79-8558beeb5fa1_1024x637.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So whenever countries experience a dollar shortage due to severely reduced trade, they can sell their US assets in order to raise cash. This cash can then be used to service their dollar denominated debts. This is one of the reasons why US stock and bond markets sold off after Trump&#8217;s Liberation day in April. Typically US Treasuries are considered safe-haven assets, resulting in a bid (yields going down) whenever there is risk-off sentiment in the market. But yields didn&#8217;t go down after Liberation Day. The went up. Which means bonds sold off hard.</p><p>The bond market is another reason why meaningful tariffs have no real staying power. Foreigners own upwards of $9T worth of our debt which they can and will sell if they are forced to raise dollars to meet their interest obligations<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. They can also sell Treasuries to create upward pressure on interest rates, which can have serious consequences for the global monetary system. It is not a coincidence that the Trump administration quickly walked back from the reciprocal tariffs and the 145% China tariff after the sharp bond market sell-off. Bessent <a href="https://www.reuters.com/markets/treasurys-bessent-market-drop-mag-7-problem-not-maga-one-tucker-carlson-2025-04-04/">made it clear</a>, they don&#8217;t really give a shit about stocks. But they absolutely care about the bond market.</p><p>The US Treasury market is the foundation of the global financial system, so any prolonged disorderliness in that market has catastrophic implications. Think of it this way. The entire dollar system is like a massive game of musical chairs, with 10,000 people to every one chair. The ever-increasing debt that flows from the Treasury market is the music. If the music ever stops, the game is over because the system is completely insolvent. The thing is, the rest of the world knows this. For this reason, their ownership of US debt offers them immense leverage at the tariff negotiating table, much more than the Trump administration is probably willing to admit publicly.</p><p>My view is that tariffs are more likely being used as negotiating leverage rather than sustainable long-term policy. I think when the dust settles we will see some tariffs on the margin for specific strategic goods (steel, aluminum) and industries (artificial intelligence, semiconductors, pharmaceuticals). Along with some sort of &#8220;Mar-a-Lago Accord&#8221; that results in a globally coordinated weakening of the dollar relative to its peers, similar to the Plaza Accord in the 1980s. <a href="https://x.com/stevemiran?lang=en">Steve Miran</a>, Trump&#8217;s chair of the Council of Economic Advisors, proposed this as a possibility in his recent paper <a href="https://www.hudsonbaycapital.com/documents/FG/hudsonbay/research/638199_A_Users_Guide_to_Restructuring_the_Global_Trading_System.pdf">A User&#8217;s Guide to Restructuring the Global Trading System</a>. That paper is long but worth it if you want deeper insight into the ideas behind the administration&#8217;s policy. To that end, the dollar has already lost about 10% since Inauguration Day. This is positive for the Trump administration&#8217;s goals, as it makes US exports more competitive.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/nth-order-effects?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/nth-order-effects?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>A Damn Shame</h2><p>Despite my view that broad reciprocal tariffs are basically not possible for any prolonged period of time due to the structural demand for dollars, I do think the underlying goal of restoring relative prosperity to the American middle class is vitally important. The current system has left many Americans disillusioned with the social contract they were raised to believe, which is that hard work pays off.</p><p>In the last 50 years, the United States has exported dollars and its manufacturing base overseas. As discussed above, this has created a strong network effect of dollar-based trade, leading to a persistently strong dollar and massive inflows of capital back into US debt and equity markets. The question is whether this is good for America or not. The answer is it depends on which American you ask. If you ask someone that owns stocks or real estate, they are probably OK with the status quo. According to the latest FRED data from the St. Louis Fed, the <a href="https://fred.stlouisfed.org/series/WFRBST01112">top 1% wealthiest Americans own 34%</a> of all financial assets, whereas the <a href="https://fred.stlouisfed.org/series/WFRBSB50193">50th wealth percentile owns just 2.6%</a>. The majority of Americans that live purely off of wages have missed out on the enormous nominal wealth that has accrued to asset owners.</p><p>The sense of falling behind is reflected in the actual data for major life milestones such as buying a home and having kids. The following chart from Apollo Chief Economist <a href="https://www.apolloacademy.com/median-age-of-homebuyers-56/">Torsten Slok</a> shows the median age of all homebuyers is at an all time high of 56 years old:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2eqj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2eqj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 424w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 848w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 1272w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2eqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png" width="1024" height="521" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:521,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30106,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2eqj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 424w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 848w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 1272w, https://substackcdn.com/image/fetch/$s_!2eqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6c64062-1c24-49c6-8411-5ab3c52d5cc6_1024x521.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This tracks with the increase in the median sales price of houses sold over a similar period:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dppN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dppN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 424w, https://substackcdn.com/image/fetch/$s_!dppN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 848w, https://substackcdn.com/image/fetch/$s_!dppN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 1272w, https://substackcdn.com/image/fetch/$s_!dppN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dppN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png" width="1024" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26396,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dppN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 424w, https://substackcdn.com/image/fetch/$s_!dppN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 848w, https://substackcdn.com/image/fetch/$s_!dppN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 1272w, https://substackcdn.com/image/fetch/$s_!dppN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1fb118-5fb3-4299-9647-0d8716dda8c8_1024x523.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Have you ever wondered why housing prices have increased so much in price? A 3-bed 2-bath today is not that physically different from what it was 40 years ago, let alone 5 years ago. Yet home prices have increased 33% <em>since COVID</em>. One reason is that homes these days are used more than just for shelter. They are investments and a means to store value, i.e., they carry a &#8220;monetary premium&#8221; on top of their value for shelter. Most mortgages these days are bundled up into a fixed income instrument called a mortgage-backed security (MBS). Each MBS is sold by Wall Street to various institutional investment firms like pensions and insurance companies. These are exactly the same financial instrument that caused the Global Financial Crisis in 2008. In the time since that crisis, the Federal Reserve has <a href="https://www.newyorkfed.org/markets/programs-archive/large-scale-asset-purchases">printed</a> <a href="https://www.richmondfed.org/publications/research/economic_brief/2020/eb_20-08">trillions</a> of dollars to directly buy MBS as part of their quantitative easing programs, in effect subsidizing the mortgage market with money created out of thin air. This &#8220;financialization&#8221; of the housing market has kept rates low and allowed for large institutional buyers and private equity firms like Blackstone to acquire massive portfolios of single family homes, pricing out the average American family.</p><p>People are having fewer kids as well. The <a href="https://www.perplexity.ai/search/can-you-research-the-relations-JaMDHBsSTmOQcU9j9ciyPQ">birth rate in the US has fallen</a> from over three births per woman in the 1960s to around 1.6 births per woman today. Statista <a href="https://www.statista.com/chart/34607/most-common-reason-for-not-having-children/">recently reported</a> a new survey that showed the most common reason cited for not having kids was financial limitations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tnx3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tnx3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tnx3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png" width="470" height="587.6434676434676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:819,&quot;resizeWidth&quot;:470,&quot;bytes&quot;:284756,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182924778?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tnx3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Tnx3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43ad4436-3bcf-45b8-b3b6-959a3db4412f_819x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This game of hyper-financialization has been playing out for decades now. You can either choose to play or not. If you don&#8217;t, you are aren&#8217;t going to keep up in nominal (monetary) wealth terms unless you have an inheritance. If you do decide to play, you have to invest time and money just to keep up with inflation, let alone to grow your wealth in real terms. The system of endless debt turns every dollar into an ice cube on hot pavement. Or, in the words of Oliver Anthony, &#8220;your dollar ain&#8217;t shit&#8221;. Case in point, the dollar has lost 25% of its value since 2020 due to the massive fiscal and monetary stimulus response to COVID.</p><p>What is the impact of this financial game that is so clearly rigged against the majority of the American public?</p><blockquote><p>Show me the incentives and I&#8217;ll show you the outcome. </p><p>- Charlie Munger</p></blockquote><p>Populist songs like <em>Rich Men North of Richmond</em>. OnlyFans. <a href="https://www.cnn.com/2024/12/04/us/brian-thompson-united-healthcare-death">Assassination of elites</a>. The <a href="https://www.perplexity.ai/search/can-you-tell-me-about-the-red-68mND1_NTBSdSPfKkvgpsA">Red Wave</a>. <a href="https://www.grandviewresearch.com/industry-analysis/online-gambling-market">Online gambling</a> and <a href="https://www.numerix.com/resources/blog/zero-day-options-0dte-start-2025-bang">0-DTE options</a>. When people struggle to <a href="https://www.foxbusiness.com/lifestyle/buy-now-pay-later-usage-groceries-nearly-doubles-consumers-struggle-food-costs">afford basic necessities like food</a>, the social fabric starts to unwind<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. Violent protests and riots start to become every day news. Many people blame politics (specifically the sitting president) for these things, and I agree that they play an important role. But the problem with our money is bipartisan; the graphs above span decades and multiple administrations, both Democrat and Republican. Tariffs may have some effect around the margin to bring specific industry back to American soil, but they will do nothing to reverse the profligate debt expansion that the current system depends on and the devastating consequences this expansion has on the value of the dollar and the American people.</p><p>What is the solution, if not tariffs? It&#8217;s a really tough question; one that I don&#8217;t have a full answer to. I think any solution has to address the root cause: the dollar as global reserve currency and the fiat financial system we have today. Replacing the dollar for a neutral reserve currency would allow for the US trade balance to reach a natural equilibrium. This transition back to a net-positive or net-zero trade regime would be painful initially, as it would require the US to consume less than it produces for some time. This would require a massive cultural shift in behavior, values, and norms, but it is possible to do. If we were to replace the debt-based system with one based on credit and hard money (i.e. money that cannot be debased), it would flip the underlying price dynamic from an inflationary one to a deflationary one. Imagine the price of eggs <em>falling</em> each year rather than rising. Mainstream economic thought would say this situation is bad because if people expect prices to fall, they will tend to hoard the currency rather than spend it, causing the economy to grind to a halt<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. I agree that massive deflation probably is not good, but why not 1% per year? It would align with the naturally deflationary force of technology, which is only going to accelerate in the coming years due to AI and robotics. Call me crazy, but gradually falling prices and robots doing my laundry and getting me groceries sounds like peak human civilization to me.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Like cash in a vault or deposits at the Federal Reserve</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The required reserve ratio has been&#8230;drumroll&#8230;0% since the COVID crash in 2020. This means the money multiplier is 1 / 0, which is a very big number.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This isn&#8217;t the only alternative source. In periods of severe market distress, the Fed steps in to supply dollars via <a href="https://www.perplexity.ai/search/can-you-explain-fed-dollar-swa-DarOzUKETzm244SduHfZ4w">swap lines</a>, which are short-term dollar loans made at the Secured Overnight Financing Rate (SOFR).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>There is likely much more nuance than described briefly here. The elephant in the US Treasury room is the insane multi-trillion dollar deficits that the US government is running, and only plan to grow with passage of things like the Big Beautiful Bill <a href="https://www.perplexity.ai/search/what-is-the-projected-impact-o-MlaYpIXORjOvhf0FrISJag">which is estimated to increase the deficit by $2.4 trillion</a> over the next ten years (excluding interest cost!). The bond market is demanding higher rates to compensate for the inflationary nature of DC&#8217;s spending addiction and the potential inflationary impact of increased tariffs.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>See the Fall of the Roman and Weimar Republics.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This is what made the Great Depression so nasty</p></div></div>]]></content:encoded></item><item><title><![CDATA[Dopamine Whack-a-mole]]></title><description><![CDATA[Have you ever told yourself that you&#8217;re going to cut back on social media use or some other compulsive behavior?]]></description><link>https://www.connorjdavis.com/p/dopamine-whack-a-mole</link><guid isPermaLink="false">https://www.connorjdavis.com/p/dopamine-whack-a-mole</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Tue, 22 Apr 2025 14:57:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever told yourself that you&#8217;re going to cut back on social media use or some other compulsive behavior?</p><p>And you actually succeed in cutting back, only to discover that your brain sneakily found some *other* thing to do compulsively instead. Something that otherwise you rarely do steps in to replace the original bad habit.</p><p>I call this the <em>dopamine whack-a-mole</em>.</p><p>It&#8217;s a game that I find myself playing quite a bit. One thing that helps me with it is being aware of when a new habitual distraction pops up. Then at least I can label it and bring it into conscious awareness. This gives me greater ability to pause the next time I reach for the new distraction.</p><p>The other thing that I&#8217;ve found to be helpful is to start each day with no phone, and instead give my attention to two things that I want to do or work on. I keep my phone in another room at night. Only after doing two things, whatever they may be, I can go on my phone if I need to. On the best days, I forget about my phone completely and get loads done. This is especially true whenever I&#8217;m trying to create something, as I&#8217;ve written about <a href="https://connorjdavis.substack.com/p/book-review-the-creative-act?r=1nb12u">before</a>. Keeping the phone away helps create space for new ideas.</p><p>Have you ever experienced this dopamine whack-a-mole? What are your favorite moles? How do you overcome them?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Book Review: The Creative Act]]></title><description><![CDATA[The Creative Act by Rick Rubin is a book that has been on my radar for a while.]]></description><link>https://www.connorjdavis.com/p/book-review-the-creative-act</link><guid isPermaLink="false">https://www.connorjdavis.com/p/book-review-the-creative-act</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Fri, 11 Apr 2025 14:53:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.amazon.com/Creative-Act-Way-Being/dp/0593652886">The Creative Act</a> by Rick Rubin is a book that has been on my radar for a while. Since I&#8217;ve been scratching my creative itch lately, I decided to read it to see what it was all about. It ended up being one the best books I&#8217;ve ever read. It spoke directly to my soul. I made a highlight or left a note on almost every page.</p><p>The book opens by dispelling the narrative that art is restricted to what we traditionally consider to be art, such as making music or painting. Rather, it argues that creativity is a universal property of human nature. Creativity is certainly expressed through paint and song, but it just as well could be expressed through a home-cooked meal, a gift, or a new business. Creativity is the act of bringing something into existence that wasn&#8217;t there before. Even if we don&#8217;t <em>do</em> anything at all and just <em>be</em>, we are still creating our version of reality through our unique filter of sensory experience. Being creative is more than an activity. It is a fundamental aspect of the human spirit. Just as birds fly and fish swim, humans create.</p><p>With this spiritual foundation in place, the book proceeds to examine how to cultivate a creative way of being in our lives.</p><p><em>Tuning in</em> is a critical component for cultivating creativity. The universe is buzzing with energetic potential that searches for a path through which it can be expressed. Humans are conduits that facilitate the manifestation of this potential into the physical world. We may think that ideas are our own, but they are really downloads from the larger creative force that surrounds us; initial seeds that have sprouted into our consciousness. This is why it is not uncommon for an idea that you have to be expressed by someone else if you choose not to bring it to life. The other person didn&#8217;t steal your idea; the idea&#8217;s time has come. The idea re-routed around you and found another path to manifestation.</p><p>How can we better tune in? Two ways: leveling up our taste and creating space.</p><p>We need to be selective of what information we let into our lives. If our bodies are the food we ingest, our minds are the information we consume. Rather than doom-scroll social media, read a book. Better yet, read classic literature. And don&#8217;t just read to read, read to understand and to form critical opinions.</p><p>To let creative inspiration in, we need to create space for it. There are two high level ways to do this, both involving filters. The first is to use a highly restrictive external filter. The second is to use a minimally restrictive internal filter, or perhaps even no internal filter at all.</p><p>Our external filter controls the information that we receive <em>from</em> the outside world. Our high volume, low quality information culture fills our stream of consciousness with low fidelity noise that drowns out any signal that would otherwise be picked up by our creative antennae. Filtering out sources of external noise allows for creative signal to be received. Sometimes the best option is to just sit in silent contemplation. Or dedicate time each day with no phone or any other external sources of information. This creates a vacuum in our minds which can be filled with creative insight. Discipline helps here, but I&#8217;ve personally found that grounding the restrictions against a set of values or higher purpose is more effective and motivating, especially at the beginning when the siren song of the external digital world is so loud.</p><p>Our internal filter controls the information that we transmit <em>into</em> the outside world. This filter may be overly restrictive due to fear or ego or by the self-constructed narratives we tell ourselves. As adults, we carry with us narratives and labels that situate our self in relation to others and society. The stories we tell about ourselves can act as a filter that restricts our creative potential. When our identity is based on a preconceived set of labels, it makes it more difficult it to act in ways that contradict that identity. We create roles for our self. These roles implicitly confine us within the boundary of <a href="https://connorjdavis.substack.com/p/metaphors-and-the-infinite-game?r=1nb12u">finite games</a>. Roles force us to play within the rules rather than play <em>with</em> the rules. How do we go about deconstructing our narratives and labels? The book recommends adopting a more childlike attitude. Children just <em>do</em> things. They live in the moment without worrying about the future or trying to conform to some identity.</p><p>Being more childlike as an adult is much easier said than done. Unfortunately, the book doesn&#8217;t really expand on how adults can be more childlike in practice, and in fact recommends not to over-analyze or force this state. However, in the spirit of playing with the rules rather than within them, I want to offer an idea for doing this. You need to seek out activities that suppress your <em>default mode network</em> (DMN). This brain region is widely believed to be the physical &#8220;location&#8221; of the self, and incidentally, is not yet fully developed in the brains of children. Suppressing this region in adults achieves a childlike state of mind via increased communication between different brain regions that don&#8217;t normally talk to one another. The reason they don&#8217;t normally talk to one another is because the DMN acts as a reducing valve that prunes connections between different brain regions. You can think of the DMN as the stern teacher that walks into a classroom. Before the teacher is in the room, there are conversations everywhere, even between people on opposite sides of the room. But as soon as the teacher walks in, the room goes silent and the more predictable patterns of conversations start to emerge (e.g., role call). How do you suppress the DMN? <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4529365/#:~:text=Abstract,in%20meditators%20compared%20to%20controls">Meditation</a>, <a href="https://www.nature.com/articles/s41586-024-07624-5">psychedelics</a>, and perhaps(?) <a href="https://connorjdavis.substack.com/p/ouray-50ish-ultra-marathon?r=1nb12u">prolonged aerobic exercise</a>.</p><p>In addition to looking at ways to tune in and what that means for creativity, the book implicitly offers a set of <a href="https://connorjdavis.substack.com/p/metaphors-and-the-infinite-game?r=1nb12u">metaphors</a> that provide psychological structure to the creative act that really resonated with me.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/book-review-the-creative-act?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/book-review-the-creative-act?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>CREATIVITY IS ABUNDANT OPPORTUNITY</p><p>Creativity is about tapping into the flow of ideas and manifesting them in the world. When you tune in to this flow and surrender yourself to it, it never runs dry. Ideas are constantly flowing. On the other hand, if we view creativity or ideas as being a competition over scarce resources, we hold onto our work, keep it in our heads, and are less prepared to share it with the world, because we think that one work will define us for the rest of our lives. A symptom of a scarcity mindset is perfectionism, which is based in fear. Fear of looking weird or bad, or otherwise not living up to the expectations that we or other people have for us. This is one form of our internal filter that we have to learn to let go of. On the other hand, an abundant mindset allows us to see each work as just one small piece within a much larger work of our lives. When we create with an abundant mindset, we are more free to start, finish, and share our work so that we may let it go and move on to the next piece.</p><p>CREATIVITY IS AN INFINITE GAME</p><p>The book describes creation as a game we play to <em>play</em> rather than to <em>win</em>. It is positive sum &#8211; great work begets more great work, both in ourselves and others. The process of tuning in, of restricting our external filter and loosening our internal filter, allows us to engage in the creative act as a forward-looking player in an infinite game. It allows for us to transcend the titles we may have acquired in other finite games we play, and create freely based on our intuitive sense of what the universe is telling us to work on rather than being restricted in scope by the roles we have created for ourselves. When we create as part of an infinite game, we honor our own curiosity by expressing what it inside of us, rather than trying to predict what will be most socially accepted at the time. Success is defined not by the level of public perception or praise, but by the consistent engagement with the process &#8211; start, finish, and share, then move on to the next one.</p><p>I&#8217;ve highlighted the parts of the book that have resonated with me the most. I&#8217;ll close with what I see as the key takeaway from the book: In order to create effectively, you need to develop self-awareness and tune in to your self. The answer is within you, not in the outside world.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Metaphors and the Infinite Game]]></title><description><![CDATA[There&#8217;s a book called Metaphors We Live By by George Lakoff and Mark Johnson that I recommend to anyone that asks for a good book to read.]]></description><link>https://www.connorjdavis.com/p/metaphors-and-the-infinite-game</link><guid isPermaLink="false">https://www.connorjdavis.com/p/metaphors-and-the-infinite-game</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Tue, 04 Feb 2025 15:34:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s a book called <a href="https://www.amazon.com/dp/0226468011?psc=1&amp;language=en_US">Metaphors We Live By</a> by George Lakoff and Mark Johnson that I recommend to anyone that asks for a good book to read. I buy extra copies and gift them to friends of mine. The premise is that metaphors are more than linguistic constructs we learn about in grade school &#8211; they are powerful tools that structure our everyday experience. This structure is not only reflected in our language, but also in our thoughts and actions.</p><p>The tricky part with the metaphors we live by is they tend to be subliminal. As soon as we start to acquire language as a child, we subconsciously download metaphors from our parents, friends, and culture. This download occurs through media and language: the songs we listen to, the books we read, the movies we watch, and the words we hear in the conversations around us. The choice of words reflect the underlying metaphor being used.</p><p>Here are some common metaphors from American culture and language that reflects them:</p><p>ARGUMENT IS WAR</p><p>I <em>demolished</em> his argument. His claims were <em>indefensible</em>. She <em>shot down</em> all of my arguments. I&#8217;ve never <em>won</em> an argument <em>against</em> him. His rebuttal was a great counterattack. I <em>attacked</em> his position. <a href="https://time.com/5318965/how-to-win-an-argument/">How to </a><em><a href="https://time.com/5318965/how-to-win-an-argument/">Win</a></em><a href="https://time.com/5318965/how-to-win-an-argument/"> Every Argument</a>.</p><p>POLITICS IS WAR</p><p><a href="https://www.theguardian.com/us-news/2024/oct/04/us-election-barack-obama-kamala-harris-campaign-pennsylvania-rally">Barack Obama to campaign for Harris across </a><em><a href="https://www.theguardian.com/us-news/2024/oct/04/us-election-barack-obama-kamala-harris-campaign-pennsylvania-rally">battleground</a></em><a href="https://www.theguardian.com/us-news/2024/oct/04/us-election-barack-obama-kamala-harris-campaign-pennsylvania-rally"> states next week</a>. <a href="https://www.cbsnews.com/news/cbs-news-vp-debate-poll-2024/">Who </a><em><a href="https://www.cbsnews.com/news/cbs-news-vp-debate-poll-2024/">won</a></em><a href="https://www.cbsnews.com/news/cbs-news-vp-debate-poll-2024/"> the VP debate</a>? <a href="https://www.npr.org/2024/11/08/g-s1-33274/2024-election-how-trump-won-takeaways">Why Trump </a><em><a href="https://www.npr.org/2024/11/08/g-s1-33274/2024-election-how-trump-won-takeaways">won</a></em><a href="https://www.npr.org/2024/11/08/g-s1-33274/2024-election-how-trump-won-takeaways"> &#8212; 9 takeaways from the 2024 election</a>.</p><p>LOVE IS WAR</p><p>She left me <em>defenseless</em>. He has a high <em>body count</em>. <a href="https://www.youtube.com/watch?v=IGVZOLV9SPo">Love is a </a><em><a href="https://www.youtube.com/watch?v=IGVZOLV9SPo">battlefield</a>. </em>They had a bad <em>fight</em> last night. She <em>killed</em> his tendency to be vulnerable.</p><p>FOOD IS A REWARD</p><p><a href="https://www.youtube.com/watch?v=5IpYOF4Hi6Q&amp;list=RD5IpYOF4Hi6Q&amp;start_radio=1">How can you have any pudding if you don&#8217;t eat your meat?</a> I&#8217;ve been good, so I&#8217;m going to <em>treat</em> myself, I <em>deserve</em> it. It&#8217;s been a long week, so I&#8217;m going to <em>splurge</em> on some chocolate. Finish your dinner <em>then</em> you can have dessert. <a href="https://www.coca-colacompany.com/about-us/history/history-of-coca-cola-advertising-slogans">Open Happiness</a>.</p><h2>Changing Metaphors</h2><p>The most impactful takeaway I had from <em>Metaphors We Live By</em> is that we get to <em>choose</em> the metaphors that define our lives. This choice comes after we bring the metaphors that we are using into our conscious awareness. Then we can look at each one and ask the question, &#8220;is this the metaphor I really want to be framing my experience with, or is there one that resonates with me more?&#8221;. If we don&#8217;t like the answer, we are free to change it to anything we want. Changing our metaphors does more than change the words we speak; it fundamentally alters how we experience day-to-day life in our thoughts and behaviors.</p><p>After I read the book, I wrote down the metaphors that had been implicitly chosen for me by society in various parts of my life. At the time, I was searching for a new job and was getting ready to start interviewing. In the past, I had always been nervous during interviews, to the point were I had trouble clearly communicating my experience. Interviews terrified me. I began to wonder, &#8220;what might the metaphor be that was causing this fear?&#8221; This is what I came up with:</p><p>INTERVIEWS ARE A COMBATIVE DEFENSE OF SELF</p><p>This metaphor was framing my experience of interviews. Where did this come from? At least in my experience, it was common to receive questions like &#8220;are you getting <em>nervous</em> for your interview?&#8221;, or even simple words of encouragement like &#8220;good luck&#8221;. The underlying assumption was that an interview is something I <em>should</em> be nervous and fearful about, or one in which luck plays an outsized role in success.</p><p>If you think about the goal of an interview, though, it doesn&#8217;t make sense to frame it as something that requires combat or self defense. Both parties share a common goal: to figure out if there is a good fit for the role through the sharing of information. A better metaphor is:</p><p>INTERVIEWS ARE A COLLABORATIVE SHARING OF INFORMATION</p><p>When you look at an interview from this point view, it immediately changes how you perceive the actual event, as well as how you prepare for it. It wasn&#8217;t until I switched to this metaphor that I understood the importance of learning how to be a better interviewer, as well as the importance of thoroughly rehearsing my past experience ahead of time. The reason is that the new metaphor aligned my mindset and preparation with the actual purpose of the interview. With this new metaphor in hand, I managed to successfully get through not one, but six interviews and ended up getting an offer. What&#8217;s even more fascinating is that my physical stress response during the interviews was significantly reduced compared to prior ones.</p><p>There are a couple other metaphors that I have changed which have been especially impactful.</p><p><s>LOVE IS A WAR</s> LOVE IS A COLLABORATIVE WORK OF ART</p><p>I have had a tendency to structure my relationships as something that I should win or defend myself for. This has manifested as me getting defensive when my partner shares a grievance with me, especially if their interpretation of what happened isn&#8217;t what I intended or matches with my version of reality. People have varying frames of reference (influenced, at least in part I imagine, by the metaphors they explicitly or implicitly are using), and so it is completely reasonable for someone to interpret an event in such a way that would make them upset, and simultaneously for me to feel that they may be overreacting because my intentions were pure. When you view your partner as an adversary on a battlefield, you dig in and try to prove to them how their version of reality is not valid. This tends to make them defensive, as it signals to them that their emotions are invalid. Now you&#8217;ve got a full blown &#8220;fight&#8221;.</p><p>However, if you view your relationship as a work of art, and your partner as a fellow artist on that work of art, the dynamic changes completely. True art is not something to be won. It exists outside the realm of competition. The best music contains tension, but it also resolves that tension with a series of notes that underscore the harmony inherent to the song. Of course relationships can develop tension as well. The goal is not to add more tension by trying to &#8220;win&#8221;, but to resolve it with the notes of listening and understanding. The goal is to keep the music playing. There is nuance here. I&#8217;m not saying it&#8217;s a good idea to just rollover for the sake of resolving tension. The best relationships have strong boundaries and healthy emotions expressed openly. But this expression is done not for victory, but for continuation of the relationship in the long term.</p><p><s>FOOD IS A REWARD</s> FOOD IS INFORMATION</p><p>Have you ever wondered why the &#8220;Food and Drug Administration&#8221; separates &#8220;Food&#8221; and &#8220;Drug&#8221; as if they are two separate things? The distinction only exists in language. To our cells, they are the same thing: chemical information. Our cells don&#8217;t use language to distinguish things the same way that our minds do. When you pop a pill or eat a steak, our cells just see various chemicals swimming around. And critically, the chemical information isn&#8217;t just discarded. It is actively <em>used</em> to create outputs, whether that be to rebuild new cells, or to render various parts of our genome.</p><p>Viewing food as information instead of a reward completely changed how I relate to eating. Before, when food was a reward, I remember justifying a dessert to myself because of something that I had done or even just based on what day of the week it was. I did this with alcohol too, especially in college. I worked hard in college during the week, which meant the weekend was for getting slammed. Now, with food as information, my diet is aligned with the needs of my body instead of my emotions. I try to give my cells what they need, which is good information. Doing that consistently leads to more energy and feeling good, both in my body <em>and</em> my mind. It has also led to more flexibility in some ways. We&#8217;ve been conditioned to eat certain meals at certain times of day, but to our bodies it is all the same. Steak for breakfast? Absolutely. One closing thought on this: I&#8217;m not saying that eating isn&#8217;t associated with reward at all. Eating a good meal with friends and family is certainly rewarding and healthy. But in this case it is the social interaction that is the reward, not the food per se.</p><p>Why is changing metaphors so effective at changing behavior? I think it relates to a mental model called the <a href="https://fs.blog/map-and-territory/">map versus territory</a>. Metaphors are &#8220;maps&#8221; of our experience that we use to simplify our navigation of the world. They project the high entropy of reality into a lower dimensional form that is easier for us to deal with. These projections are saved into our library of mental subroutines that are triggered by our subconscious when certain environmental patterns are detected. The thing about maps is that some are better than others. When we upgrade our maps to versions that are closer to reality, they allow us to be more effective in our navigation of the world.</p><p>If changing metaphors can be beneficial, how do you change them? The first step is to be aware of what your current metaphors are. This is easier said than done, because metaphors we live by are ingrained in us and reside underneath our conscious awareness. Try reading <em>Metaphors We Live By</em>. The book gives a lot of examples that may trigger ideas for your life. After you read it take some time to be introspective and list out the different aspects of your life. See if any of these are particularly stressful or fearful. Then ask &#8220;what might the metaphor be related to this aspect of my life that is causing me to feel this way?&#8221; Other things that have helped me are meditation and psychedelics. Both of these help bring the subconscious to the conscious, including metaphors. Once you are aware of your metaphors, all you have to do is decide if they actually resonate with you. If they don&#8217;t, change them.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/metaphors-and-the-infinite-game?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/metaphors-and-the-infinite-game?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Finite versus Infinite Games</h2><p>I&#8217;ve noticed an underlying theme in the metaphors presented in <em>Metaphors We Live By</em> as well the more ill-fitting ones for my own life, including those I&#8217;ve changed as discussed above. Or rather than theme, perhaps a better characterization would be an <em>&#252;ber</em>-metaphor; the root metaphor from which all the other metaphors descend from. The common thread is they involve <em>winning</em> something. They characterize the major aspects of life &#8211; school, work, relationships, money, etc &#8211; as arenas of play in which there can only be one winner.</p><p>This characterization of life as different forms of play is captured in another book called <a href="https://en.wikipedia.org/wiki/Finite_and_Infinite_Games">Finite and Infinite Games</a> by James Carse that I highly recommend. The book defines two types of games: finite games and infinite games.</p><p>Finite games have strict rules and a fixed boundary of play. The goal of the game is to <em>win</em>. And for one player to win, the other players must lose. In this way, finite games are <em>zero-sum</em>. Finite games are also defined by <em>roles</em>. When we play a finite game, we assume a role that carries a title, which has some meaning within the context of the game itself, like &#8220;quarterback&#8221; or &#8220;senior vice president&#8221;. Titles are abstractions that indicate completed victories and therefore orient players towards the past. The orientation towards the past keeps finite play <em>within</em> the boundary of the game. In finite play, life is scarce; something to be won, possessed, and acquired.</p><p>The infinite game is different. It is chaotic and non-stationary where the rules and players continuously change. There may be a boundary temporarily, but it is generally amorphous and dynamic with respect to time. The goal is not to win, but to <em>keep playing the game</em>. The infinite game is <em>positive-sum</em>, where players&#8217; contributions enable the game to continue, both for themselves and for others. The infinite player does not completely avoid finite games, but rather recognizes them as small pockets of play subsumed by the infinite game. Infinite players embrace the abstractness of finite games they willingly choose to compete in, and therefore take them up playfully, not seriously. Surprisingly (to me), infinite players do <em>not</em> eschew the roles of finite games, but rather assume them with the full acknowledgement that they are yet another abstraction, completely separate from who they really are. Where the finite player is oriented towards the past and fully constrained by and within the rules of the game, the infinite player is oriented towards the future and therefore is free to play <em>with</em> the rules of the game. Because infinite players understand that there is a bigger game being played in which any particular finite game is but a small part. The infinite game admits ample degrees of freedom for manipulating existing rules or creating new rules entirely. In infinite play, life is abundant; something to be experienced and created.</p><p><s>LIFE IS A FINITE GAME</s> LIFE IS AN INFINITE GAME</p><p>Love is a battlefield. Argument is war. Food is a reward. All of these metaphors descend from the root metaphor LIFE IS A FINITE GAME. When we view every aspect of our life as an opportunity to win or acquire something, we are playing the game of life as a finite player. Finite play orients our existence towards the acquisition of abstract titles, colors our relationships with adversarial undertones, and instills our mindset with a sense of scarcity.</p><p>On the other hand, if we choose to live by LIFE IS AN INFINITE GAME, we are choosing to approach life with a certain playfulness rather than seriousness. Choosing infinite play doesn&#8217;t mean we opt out of all finite games. It means the way we play them changes. Instead of conflating our true sense of self with the roles required for any particular game, we maintain separation between them and fully exercise our ability to assume and unassume our role within the finite games we play. This separation allows us to maintain perspective on what truly matters to us, which can only be determined outside the confines of bounded games and their attendant rules and titles.</p><p>Love is a collaborative work of art. Argument is dance. Food is information. All of these metaphors descend from LIFE IS AN INFINITE GAME. The goal is not to win, but to continue playing. Infinite play orients us towards future possibility rather than the past. It emphasizes healthy boundaries and cooperation and instills in us a sense of abundance.</p><p>Now that you know metaphors can be much more than a literary device, can you think of any that may need an update for your own life? Raising awareness of existing metaphors and proposing new ones has been great benefit to my own life. Perhaps it will benefit you too.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Ouray 50ish Ultra Marathon]]></title><description><![CDATA[I had the opportunity to participate in the Ouray 50 ultramarathon in September.]]></description><link>https://www.connorjdavis.com/p/ouray-50ish-ultra-marathon</link><guid isPermaLink="false">https://www.connorjdavis.com/p/ouray-50ish-ultra-marathon</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Sat, 09 Nov 2024 15:01:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PFgy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I had the opportunity to participate in the <a href="https://www.ouray100.com/map-elevation-50-miler">Ouray 50 ultramarathon</a> in September. The Ouray 50 is kind of insane. Runners have 24 hours to run 50 miles<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> with over 20,000 feet of elevation gain around the historic mining trails of Ouray in southwest Colorado<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><p>This was the second ultra I&#8217;ve done in my life. I can&#8217;t really describe why I decided to do arguably the most difficult 50-miler in the ultramarathon world as my second-ever race (the first I did was a 50k). It was more of a gut feeling that I should do it rather than the result of calculated decision making. A call to adventure, if you will.</p><h2>Training</h2><p>I started training for the race in March of this year. I focused on building out my aerobic base from March until about mid-July. This consisted of high volume, low-intensity runs, in addition to maintaining a fairly regular schedule at my gym for strength and mobility. My goal for each run during this period was to stay under (within 10 beats of) my aerobic threshold for the entire run. About two months out from the race, I shifted focus on elevation and strength training. I trained on the trails around Ouray, doing 6 to 9 hour days on the weekend.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1UF0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1UF0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 424w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 848w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 1272w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1UF0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png" width="725" height="413.25719960278053" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1007,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:47660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1UF0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 424w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 848w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 1272w, https://substackcdn.com/image/fetch/$s_!1UF0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7916c1-9a45-4a45-9340-7ee704fb00ac_1007x574.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 1: My Garmin data from this year. Note September (the month of the race) is low because I got a Coros at that time which has my race data on it.</em></figcaption></figure></div><p>As you can see from my Garmin data above, my highest average distance during training was around 25 miles per week. There were a few runs that I didn&#8217;t have my watch on, but not enough to have put me much over the 25 miles per week mark, if at all. Since I&#8217;m not a professional runner by any means, my goal was just to finish before the 24-hour cutoff. I didn&#8217;t have much more time to train given my job, social responsibilities, and other hobbies. I didn&#8217;t follow any formal training plan or coaching.</p><p>Despite the lack of a formal training regimen, I did follow the recommendations from Mark Sisson&#8217;s book, <a href="https://www.amazon.com/Primal-Endurance-chronic-carbohydrate-dependency/dp/1939563089/ref=tmm_pap_swatch_0">Primal Endurance</a>. One key argument of the book is that endurance athletes tend to overtrain and underrecover. This is especially easy to do when you already have good fitness and you are following a strict training schedule that only focuses on volume. The book shows that subjective feelings of rest aren&#8217;t necessarily sufficient to determine your recovery level. Instead, you need to measure <a href="https://en.wikipedia.org/wiki/Heart_rate_variability">heart rate variability</a> (HRV) and combine that with how you feel. A lower HRV means you should probably rest more before the next training session. During my own training, there were many mornings where my body felt good, but my HRV was in the ditch, so I ended up taking extra rest days until my HRV recovered to baseline or better.</p><p>The other major lever I used to train had nothing to do with exercise: nutrition.</p><h2>Nutrition</h2><p>I&#8217;m an ardent reader of <a href="https://www.lynalden.com/">Lyn Alden&#8217;s investment research</a>. Her macro takes are probably the most clear and well written economic analysis I&#8217;ve ever read. She also has a fantastic book called <a href="https://www.amazon.com/Broken-Money-Financial-System-Failing/dp/B0CG8985FR">Broken Money</a>. Read it. Take notes. Everyone should read that book.</p><p>You&#8217;re probably wondering, what the hell does Lyn Alden and investment research have to do with my nutrition for the Ouray 50? Well, she has a non-financial article called <a href="https://www.lynalden.com/increase-energy/">12 In-Depth Tactics to Seriously Boost your Energy</a> that I happened to read in May, espousing the benefits of low-carb ketogenic diets for mental and physical performance. In that article, she also recommends the book <a href="https://www.amazon.com/Art-Science-Low-Carbohydrate-Performance/dp/0983490716">The Art and Science of Low Carbohydrate Performance</a>.</p><p>I had experimented with keto a couple years prior, but wasn&#8217;t able to stick with it for more than a week. Reading Lyn&#8217;s article and book recommendation is what made me want to try again, so I did. I transitioned to a strict, well-formulated keto diet in May, emphasizing whole food from quality sources. No ultra-processed keto breads or desserts. Every week I went to my local farmer&#8217;s market to buy grass-fed, grass-finished beef from a local ranch, as well as fresh greens, vegetables, and herbs from regenerative farms.</p><p>My goal here is to share my subjective, &#8220;n of 1&#8221; experience with keto, especially in the context of training for and running a long, difficult race. The goal is not to convince you to start eating a low-carb diet. It bugs me whenever I hear people claiming a particular diet is the clearly &#8220;the best&#8221; for <em>everyone</em>. Food is not a finite game to be won and lorded over other people. Food is information. And besides, nutrition is unsettled science. In my opinion, the human body is an irreducibly complex system<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> in the mathematical sense, so nutrition will remain unsettled science for a long time. Luckily, I don&#8217;t need science to tell me how food makes me <em>feel</em>. I can use my body and intuition as a guide, no science required.</p><p>Now, my subjective experience. After the first few days of keto, I felt increased mental clarity and general cognitive &#8220;wired-ness&#8221; throughout the day. And I started feeling <em>really good</em> at the 6-week mark. At that point, just about every aspect of my life was notably better &#8211; sleep, cognition, energy levels, mood. I had significantly less gas and significantly better erections. I had multiple people comment on how good my skin looked and that I looked healthier. When I first experimented with keto a couple years ago, I sensed a difference in how my brain felt after only a couple of days. However, back then I didn&#8217;t stay strict with the diet for more than a week, and I didn&#8217;t have any of the other benefits that I have experienced this time around at the 6-week mark.</p><p>I was also running a lot during this time (see Figure 1 above), so you may think that all these benefits were from the running instead of the diet. But I didn&#8217;t have these benefits when I trained for the 50k last year on a high-carb diet. And it&#8217;s been about 7 weeks since the race, and I&#8217;ve run maybe 10 miles total, but I still feel the same benefits and am still doing keto.</p><p>I started many of my training runs fasted. On the trail I drank water and sugar-free electrolytes and ate high-fat homemade trail-mix containing nuts, MCT oil, basil seeds, and 78% dark chocolate. I added in the dark chocolate after a particularly hot, steep training day when I was squarely in Zone 3 and got dizzy on the uphill climb. I suspect that the rate of gluconeogenesis was not fast enough to prevent hypoglycemia, so my brain started shutting things down. Of course that&#8217;s speculation; I didn&#8217;t have a way of proving it. Adding in a couple grams of sugar seemed to help on the subsequent training days that were particularly hot.</p><p>Now that my subjective experience is out of the way, I&#8217;m going to share some of the things I&#8217;ve learned while researching the ketogenic diet and some its potential benefits for endurance sports. Fair warning: the next section is technical and contains a lot of jargon from nutritional biochemistry. If you don&#8217;t care about this, feel free to jump to <em>Race Day</em> below where I discuss how the actual race went. If you do care, well, down the rabbit hole we go&#8230;</p><h2>Fundamentals of ATP Production</h2><p>ATP is the currency of life. If our mitochondria stop making ATP<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, we die within seconds. ATP is made from two major metabolic pathways: glycolysis and lipolysis. How do our mitochondria decide which pathway to use? Insulin, mostly<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. There are other hormones at play as well, but insulin is the one we can most directly affect through the food we eat. Insulin rises when we eat food with a moderate to high amount of net carbohydrates. Insulin drops when we fast or consistently eat food with a low amount of net carbohydrates. Whenever insulin is consistently low, the first thing our body does is release glucagon, which is a hormone that tells the liver to release its glycogen stores as glucose back into the bloodstream. The problem is the liver can only store about 12 to 24 hours worth of energy in the form of glycogen, depending on your activity level.</p><p>So what happens when we are not able to replenish glucose and glucagon through diet? Maybe it&#8217;s winter and the berry stock has long been exhausted. Or we live in an Arctic zone where no plant life exists in the first place.</p><p>Thankfully we don&#8217;t just keel over. Instead, our body starts to leverage the abundant energy stored in fat cells (even a lean person can store around ~20x more energy in fat than they can in their glycogen stores<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>) through a process called lipolysis. Lipolysis hydrolyzes the triglycerides in our adipose tissue into glycerol and free fatty acids (FFAs). FFAs are carried into the bloodstream and used by most of the cells and organs in our body. The FFAs are converted by &#946;-oxidation to Acetyl-CoA, which then enters the Krebs cycle.</p><p>There is one major exception to using FFAs for fuel: the brain. The brain chooses not to oxidize FFAs for ATP<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>, which is unfortunate, since the brain accounts for around 20% of our basal metabolic demand. As far as I can tell, the reason behind the brain eschewing FFA oxidation is currently unsettled<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. So what does the brain use when there isn&#8217;t enough glucose and it doesn&#8217;t use FFAs? In this case, ATP levels are low, so the ratio of AMP to ATP increases, which activates an enzyme called <a href="https://en.wikipedia.org/wiki/AMP-activated_protein_kinase">AMP kinase</a> (AMPK<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>). In turn, AMPK initiates the process of ketogenesis, the production of ketone bodies in the liver. Ketones enter the bloodstream, eventually reaching the mitochondria in our brain (and other tissues). The receiving mitochondria convert the ketone body &#946;-hydroxybutyrate (&#946;-OHB) to Acetyl-CoA, giving it another entry point into the Krebs Cycle. Since &#946;-OHB<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> is derived from triglycerides, this means our brain and other cells can generate ATP from fat, allowing us to survive for days to weeks to months, depending on how much fat we have stored up.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1rc9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1rc9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 424w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 848w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 1272w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1rc9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png" width="674" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:674,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88334,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1rc9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 424w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 848w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 1272w, https://substackcdn.com/image/fetch/$s_!1rc9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d551ce-1e4b-41f1-9e47-c15606d3be12_674x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Glucose and &#946;-OHB pathways into the Krebs Cycle. Source: <a href="https://www.researchgate.net/profile/Denis-Barry/publication/324687999_The_ketogenic_diet_in_disease_and_development/links/5da5a21e299bf116fea9154a/The-ketogenic-diet-in-disease-and-development.pdf">Barry et al., 2018</a></figcaption></figure></div><p>However, the brain still demands some amount of glucose in the bloodstream, otherwise we become hypoglycemic. That is why if you&#8217;ve ever checked your blood when fasted, even if your &#946;-OHB levels are in the 1-3 millimolar range, your glucose will be on the low end of the normal range. How could this be if you haven&#8217;t eaten a single carbohydrate in days? The glucose is created from other molecules in the body through <em>gluconeogenesis</em>. Remember the glycerol backbone from the hydrolyzed triglycerides? Gluconeogenesis recycles this glycerol back into glucose. Lactate, pyruvate, and amino acids are other substrates for gluconeogenesis.</p><p>A key point to note is that whenever you first start inducing ketosis, it takes time for cells to adapt to the new fuel source. Ketones enter cells via monocarboxylic transporters (MCTs), which are upregulated via AMPK<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. This adaptation can take 4 to 6 weeks to be fully expressed in cell membranes.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cgN8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cgN8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cgN8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:454637,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cgN8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!cgN8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fdab56e-725a-4ebe-a5cb-b122cadb54e0_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: Ketogenesis Pathways. Source: <a href="https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-021-02185-0">Kolb H. et al., 2021</a></figcaption></figure></div><p>So, we&#8217;ve refreshed our memories on ATP production. We know the two primary paths for producing ATP. And we know the basics of ketogenesis. Here&#8217;s the question: does ketogenic metabolism provide benefits to endurance athletes relative to glycolytic metabolism?</p><p>It&#8217;s a tough question. There are many dimensions to ultra endurance running for which benefit could be provided or not. I haven&#8217;t attempted to go through all of them. Instead I&#8217;ll focus on one area in particular: oxidative stress.</p><h3>Managing Oxidative Stress</h3><p>Whenever we exercise, our mitochondria produce large amounts of <a href="https://en.wikipedia.org/wiki/Reactive_oxygen_species">reactive oxygen species</a> (ROS). ROS are ionic oxygen compounds that increase inflammation, denature surrounding proteins, and cause cellular damage. One of the primary victims of ROS are <a href="https://en.wikipedia.org/wiki/List_of_unsaturated_fatty_acids#Arachidonic_acid">polyunsaturated fatty acids</a> (PUFA)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>. PUFA are ubiquitous in human tissues and are found in cellular phospholipid membranes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>.</p><p><a href="https://www.amazon.com/Art-Science-Low-Carbohydrate-Performance/dp/0983490716">The Art and Science of Low Carbohydrate Performance</a> proposes that degradation of membrane PUFA by unquenched reactive oxygen species may be a key driver in long recovery times; the more muscle cells that are damaged, the longer it takes to heal. The book estimates that a runner in a 100-mile race would consume roughly 7lbs of oxygen. Assuming a ROS production rate of 2%, there would be around 63 grams of ROS produced over the course of the race. That is enough ROS to degrade 3x the PUFA content in a runner&#8217;s legs. The authors also hypothesize that PUFA degradation in the gut could be a cause of the gastrointestinal distress commonly encountered in ultra marathons.</p><p>ROS can also inhibit blood flow. Specifically, the enzyme myeloperoxidase (MPO) is abundant in immune cells and generates ROS from H2O2<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>. The MPO in activated immune cells binds with nitric oxide, reducing the bioavailability of nitric oxide to cells in the endothelium<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-16" href="#footnote-16" target="_self">16</a>. Nitric oxide is a potent vasodilator, so reducing nitric oxide leads to endothelial dysfunction, vasoconstriction, and poor blood flow (which is probably not something you want during prolonged exercise). Unfortunately, exercise is not the only source of oxidative stress; ROS is a byproduct of metabolism in general, not just during exercise. And there is evidence that the particular pathway taken to make ATP can greatly influence the amount of oxidative stress.</p><h3>Glycolysis and Oxidative Stress</h3><p>Excessive carbohydrate consumption leads to oxidative stress. Figure 4 below is from a study<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-17" href="#footnote-17" target="_self">17</a> that reviewed various mechanisms of ROS-generating pathways in animal models of hyperglycemia in the context of diabetic neuropathy:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2hb8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2hb8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 424w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 848w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 1272w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2hb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png" width="1024" height="717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:717,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:331583,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2hb8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 424w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 848w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 1272w, https://substackcdn.com/image/fetch/$s_!2hb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1922437-217e-4bda-b0e8-f12fd5e2f4ef_1024x717.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: Glycolytic pathways to oxidative stress during hyperglycemia. Source: <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4239697/">Figueroa-Romero C. et al.</a>, 2008</figcaption></figure></div><p>The advanced glycation end-products (AGE) pathway is what you get when you combine high levels of dietary sugar with the proteins and fats in cells<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-18" href="#footnote-18" target="_self">18</a>. Glucose and fructose metabolism creates reactive metabolites such as glyoxal and methylglyoxal. These metabolites bond with proteins and lipids, forming AGEs which disrupt the normal function of cells. Methylglyoxal specifically has been shown in animal models to increase oxidative stress through increased production of superoxide and hydrogen peroxide, as well as reduced antioxidant capacity via NADPH depletion in the endothelium, kidneys, and brain<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-19" href="#footnote-19" target="_self">19</a>.</p><p>AGEs also lead to the expression of a receptor for AGEs (called RAGE) on cell surfaces. RAGE activation is <a href="https://youtu.be/VM8TY_FCm-Y">implicated in cardiovascular disease and obesity</a> and also up-regulates nuclear factor kappa B (NF-&#954;B)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-20" href="#footnote-20" target="_self">20</a>. Chronic NF-&#954;B activation has been shown to drive a positive feedback loop between pro-inflammatory response and production of ROS, leading to blood vessel and neuronal damage<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-21" href="#footnote-21" target="_self">21</a>.</p><p>Over-stimulation of the polyol pathway is another source of increased ROS in hyperglycemic conditions. This pathway reduces glucose to fructose (<a href="https://youtu.be/dBnniua6-oM">fructose itself produces 7x more AGEs than glucos</a>e; <a href="https://youtu.be/VM8TY_FCm-Y">methylglyoxal produces 250x more AGEs than glucose</a>) leading to excessive depletion of NADPH. NADPH provides the reducing agents required for regeneration of glutathione, a key antioxidant, which leads to increased levels of unquenched ROS<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-22" href="#footnote-22" target="_self">22</a>.</p><h3>Ketosis and Oxidative Stress</h3><p>Many keto pundits claim that ketones are a &#8220;cleaner burning fuel&#8221; than sugar without going into detail on what &#8220;cleaner burning&#8221; actually means from a metabolic perspective. I did some digging to try to answer this question, and based on what I&#8217;ve learned, I think it is reasonable to interpret &#8220;cleaner burning fuel&#8221; as one that is less prone to reactive oxygen species and inflammation.</p><p><a href="https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-021-02185-0">This review</a> by Kolb et al. discusses the interesting possibility of ketone metabolism being initially <em>pro</em>-oxidative stress, followed by a long-term adaptation that is <em>anti</em>-oxidative stress.</p><p>It looked at animal studies<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-23" href="#footnote-23" target="_self">23</a> and in-vitro human studies<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-24" href="#footnote-24" target="_self">24</a> that showed increasing acetoacetate, one of three ketone bodies, increased mitochondrial ROS. And increasing both acetoacetate and &#946;-OHB increased NADPH oxidase activity, resulting in less NADPH available for glutathione production. The review then highlights seemingly contradictory studies that show up-regulation of anti-oxidant and anti-inflammatory defenses of ketone bodies. For example<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-25" href="#footnote-25" target="_self">25</a>, an (in-vitro human / in-vivo mouse) study showed increased &#946;-OHB lead to increased FOXO3a activity. FOXO3a is a genetic transcription factor that leads to increased expression of antioxidant enzymes: superoxide dismutase 2 (SOD2) and catalase<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-26" href="#footnote-26" target="_self">26</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eZ6t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eZ6t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 424w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 848w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eZ6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png" width="864" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:864,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:544710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eZ6t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 424w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 848w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!eZ6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00090a8b-e3a0-422f-8a2b-902bd1f533c4_864x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5: Hormetic response time to ketogenic diet. Source: <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3322307/">Milder J. and Patel M., 2013</a></figcaption></figure></div><p>The review resolved this contradiction by analyzing the time allowed for adaptation in each study and found it takes 48 hours to several weeks for the complete expression of anti-oxidant defenses. These defenses include up-regulation of Nrf2 and SIRT3. Nrf2 decreases NF-&#954;B expression, enhances SOD2 activity, and decreases MPO activity, while SIRT3 increases NADPH availability. In simpler terms, after several weeks, inflammation decreased, ROS decreased, and anti-oxidant capacity increased<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-27" href="#footnote-27" target="_self">27</a>.</p><p>What about those AGEs from glucose and fructose metabolism? We know hyperglycemia contributes to excessive AGE formation, but does this mean that ketogenic metabolism produce less AGEs overall? One place to look is HbA1c, aka <a href="https://en.wikipedia.org/wiki/Glycated_hemoglobin">glycated hemoglobin</a>. Technically, HbA1c is an early glycation end-product, but Turk et al. showed a significant positive correlation between HbA1c and hemoglobin-related AGEs (Hb-AGE)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-28" href="#footnote-28" target="_self">28</a> in diabetics. A meta-analysis from 2020<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-29" href="#footnote-29" target="_self">29</a> analyzed the effect of the ketogenic diet on HbA1c in humans with Type-II diabetes. The analysis spanned 13 studies with a total sample size of n=567 people, and eight of the studies (n=422) measured HbA1c before and after treatment with the ketogenic diet. The plot below shows the average change in HbA1c across the eight studies:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UANB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UANB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 424w, https://substackcdn.com/image/fetch/$s_!UANB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 848w, https://substackcdn.com/image/fetch/$s_!UANB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 1272w, https://substackcdn.com/image/fetch/$s_!UANB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UANB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png" width="640" height="480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27333,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UANB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 424w, https://substackcdn.com/image/fetch/$s_!UANB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 848w, https://substackcdn.com/image/fetch/$s_!UANB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 1272w, https://substackcdn.com/image/fetch/$s_!UANB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4d66559-fc12-44bf-95f4-0217489afcdf_640x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 6: HbA1c change vs Ketogenic Diet Duration. Data source: <a href="https://www.nature.com/articles/s41387-020-00142-z.pdf">Yuan X. et al, 2020</a>. Code for making the plot is on <a href="https://github.com/cjams/research-plots/blob/main/make-plot.py">my Github</a>.</figcaption></figure></div><p>As you can see, each study found a decrease in HbA1c. The average decrease across the eight studies was -1.07%, which is significant, given that a 1% change can mean the difference between normal and diabetic levels <a href="https://diabetes.org/about-diabetes/a1c">according to the ADA</a>. Now there are some limitations to applying these studies to determine if ketogenic metabolism produces fewer glycated end-products <em>relative to</em> glycolytic metabolism. The first is that three of the studies (the blue bars above) had no control group with an alternative diet, so we can&#8217;t make a comparison for those cases. The studies also had varying methods to measure adherence to the diet. One had daily &#946;-OHB measurements, another had weekly &#946;-OHB measurements, and one relied on self-reporting of what the participants had eaten. The level of calories also wasn&#8217;t controlled for; one was hypocaloric, while others had no caloric restriction at all. Additionally, the type and number of diabetic medications varied across studies and participants. Finally, the sample sizes were small and limited only to Type-II diabetics, with the largest study only having 238 participants. Given these limitations, we can&#8217;t used these studies to logically conclude that ketogenic diets cause a lower HbA1c relative to high-carb diets. However there does appear to be a correlation, at least in diabetics.</p><p>So, in review, we have some evidence that ketogenic diet may reduce oxidative stress through various mechanisms, including down-regulation of inflammation, up-regulation of anti-oxidant defenses, and decreased AGEs. Does this really prove that the ketogenic diet produces less oxidative stress than a high-carb diet? No. First of all, my review effort here was only over a small subset of the research. It was by no means an exhaustive literature review. Second, many of the studies that I did review had uncontrolled, potentially confounding variables that make it impossible to establish causation.</p><p>That said, the data at least show some interesting correlations. Enough to make me curious if I would perceive any of these potential benefits throughout my training and the race itself. Subjectively my recovery times on longer (15+ mile) runs were probably around a couple of days, mostly limited by HRV instead of muscle soreness or fatigue. Unfortunately I don&#8217;t have a &#8220;control&#8221; to compare with, since the training was different than the 50k training I did the year before.</p><p>Alright, enough from the ketogenic tangent. This post is about the Ouray 50, remember? Next, I&#8217;ll describe the race itself and some of the things I learned.</p><h2>Race Day</h2><p>My nerves started ramping up about a week out from the race. I didn&#8217;t sleep well the entire week. By the time race day finally came, I was both excited and ready to get the damn thing over with. I ate my normal breakfast of bacon and eggs with some coffee and got to Fellin Park around 11AM to check in. The weather was clear with a <a href="https://www.wunderground.com/history/daily/KTEX/date/2024-9-14">high of around 73 degrees</a>. This doesn&#8217;t seem that hot on paper, but at elevation it was, especially given the first two climbs were on southerly aspects with sparse tree cover.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-GDc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-GDc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-GDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png" width="598" height="852.8579387186629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:718,&quot;resizeWidth&quot;:598,&quot;bytes&quot;:1547455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-GDc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-GDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b47c-2aba-4055-a7c9-5e86e7f27cbf_718x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 7: Ouray 50 Course. Source: https://www.ouray100.com/map-elevation-50-miler</figcaption></figure></div><h3>Fellin Park to Alpine Mine Overlook</h3><p>The race started at 12PM at Fellin Park in Ouray. The course proceeded south and counterclockwise along the Ouray Perimeter Trail, until it reached the intersection with Camp Bird Rd on the south rim. It then ascends a few miles to Weehawken trail which leads to Alpine Mine Overlook.</p><p>I came out way too fast due to a combination of pent-up anxiety and inexperience. I didn&#8217;t have a strict game plan on heart rate or pacing. As you can see from the heart rate data below, I was anaerobic for over <em>three hours</em> at the beginning of the race. Despite this, I felt pretty good going up and down Weehawken. However my body and my mental state took a nosedive as I started the third climb up Hayden Mountain.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ib4C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ib4C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 424w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 848w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ib4C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png" width="534" height="479.9106256206554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:905,&quot;width&quot;:1007,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:236643,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ib4C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 424w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 848w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 1272w, https://substackcdn.com/image/fetch/$s_!Ib4C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d6ec43-dadf-4e00-acb1-05eb91f9a9fc_1007x905.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 8: My heart rate data from the race. Source: my heart (and my Coros)</figcaption></figure></div><h3>Hayden to Crystal Lake</h3><p>The crux of the race for me was the Hayden ascent. I started up Hayden trail around 2:45PM, the hottest part of the day. I felt dizzy and lightheaded. Luckily, Hayden offered a bit more tree cover relative to Weehawken, especially at the beginning, otherwise I think I would have passed out. I started eating blueberries and my homemade trail mix, but it was too little, too late. At this point, I was struggling to keep my heart below 170, and I had to completely stop roughly 10 times on the way up. The symptoms I felt were similar to the (what I assume was) hypoglycemia on particularly hot training days, but more intense and persistent. About halfway up Hayden, around mile 10, my body refused to swallow any more of the high-fat trail mix I made. It felt as if my brain was shutting down my digestive tract in order to preserve vital organs or something. I didn&#8217;t need to vomit, I just literally couldn&#8217;t swallow the food.</p><blockquote><p>Everyone has a plan until they get punched in the mouth &#8211; Mike Tyson</p></blockquote><p>Suffice it say, I had been punched squarely, unceremoniously, right in the mouth. And it was only mile 10! I hit my mental low. My stream of consciousness was filled with negative self-talk: &#8220;How could I have been so stupid to come out so fast? I&#8217;ve ruined months of training. I look like a fool! I&#8217;m trying to run 50 miles on the keto diet? I can hear the mockery now. My family came all this way, just to see me fail.&#8221;</p><p>Slowly, I recovered. As I stumbled up the loose, ball-bearing filled switchbacks of Hayden trail, I remembered what I had (serendipitously?) read a few weeks before the race from <a href="https://connorjdavis.substack.com/p/book-review-mans-search-for-meaning">Man&#8217;s Search for Meaning</a>. <em>The meaning of life is not what you get out of life, but what life gets out of you.</em> It&#8217;s in how you show up and do the work, especially when it gets hard and you feel like quitting. My mental milieu became a weird mixture of positivity and self-loathing. I decided in that moment that there would be only two outcomes. Either I miss a cutoff and am required to drop, or I finish. There would be no quitting.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2uPS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2uPS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2uPS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg" width="454" height="605.3333333333334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:768,&quot;resizeWidth&quot;:454,&quot;bytes&quot;:258406,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2uPS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2uPS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa98171d6-3022-489c-94a4-06ee2ea3a439_768x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 9: My dog and I ascending Hayden during a training run, right before ball-bearing alley.</figcaption></figure></div><p>I reached Crystal Lake with 20 minutes before cutoff. Crystal Lake was the first aid station that allowed crews and drop bags. My family and friends were there, giving me encouragement to keep pushing. Their presence was energizing. I felt so grateful they were there, that they would take time out of their busy lives to see me voluntarily struggle up and down mountains. One of them got me some bacon to eat, but my body wouldn&#8217;t have it. The only things my brain allowed down my throat were fruit, water, and electrolyte pills. And this remained true for the rest of the race. My plan to consume high-fat nutrition had been completely dissolved against the reality of my body. Thankfully I had plenty of fat stores and was 4 months in to fat and ketone adaptation. Now I just needed to get and stay aerobic so I could start mobilizing the fat in volume and spare the residual glucose for the brain.</p><h3>Crystal Lake to Fellin Park</h3><p>With about 10 minutes before cutoff, I started back up Hayden trail, this time from the Crystal Lake side towards Fellin Park. I was still moving slow and was near the back of pack, but I felt much better after the short rest at the aid station. It was dusk by the time I got to the black scree at the top of Hayden ridge. By the time I stopped to put my headlamp on, the temperature had dropped about 20 degrees and my heart rate was in zone 2. I started the long descent down Hayden to Fellin Park feeling much better. My mind was clear and I could finally keep a steady pace without blowing up my heart rate.</p><p>About a mile away from Fellin Park aid station, there was a small tributary to the <a href="https://en.wikipedia.org/wiki/Uncompahgre_River">Uncompahgre River</a> that crossed the course. I took off my running vest and did a cold plunge for a couple minutes. I could slowly feel my legs getting numb as the icy-cold water flowed over my legs. In that moment, sitting in the freezing water, I was overcome with an intense rush of euphoria and heightened awareness of my connection with Nature. The water on the mountain stripped away my egoic concerns about &#8220;racing well&#8221; or how others may perceive my sloppy start of the race and replaced it with a quiet, authoritative awareness of the present moment. An appreciation for the harmony of life, for health, for family and friends.</p><p>When I finally got to Fellin aid station around 9PM, my crew was waiting there for me. They brought me fruit and warm broth, which I promptly devoured as I sat next to one of the space heaters. I shared my Hayden descent and cold plunge experience with them, and they offered more encouragement. Again, I felt so connected with each of them and grateful for their presence. There was a strange tinge to that moment. Despite having known my family and friends around me for so long, it was as if I was <em>seeing</em> them with a new vibrant clarity. A raw presence that was revealed when the veil of preconceived labels and conditioned mental representations was lifted away from my perception. I&#8217;m still curious about what this sensation was and what caused it. It felt as if everything was salient, similar to moving to a new place or the afterglow of a psychedelic trip.</p><p>After this commune with fruit and family, I was riding high. The race was halfway over, and I was allowed a pacer for the remaining four sections. Despite there still being 25 miles, I felt the worst was behind me.</p><h3>The Back Half</h3><p>Under the cool night sky, the running on the back half was rather steady and uneventful. One foot after another, the miles passed by. Up Twin Peaks via Old Twin Peaks trail, and down to Silvershield with only 10 minutes to spare before cutoff. I arrived back at Fellin Park early Sunday morning, around 1am, and most of my crew were asleep. I ate some more fruit and broth, then picked up another pacer to check off Chief Ouray Mine. Once again back at Fellin, more fruit, more broth. As I was preparing to head out to the final leg up Bridge of Heaven, there were a few racers that crossed the finish line. I was happy for them but also jealous. I wanted to be done. My body felt good, but I was starting to get sleepy and losing mental acuity. However I had done 40 miles and wasn&#8217;t about to drop out on the last leg. Cutoff or finish, remember?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PFgy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PFgy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PFgy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg" width="482" height="642.6666666666666" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:768,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:196456,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PFgy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PFgy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbf6bc8-897d-4c99-9cc0-525bf8976d8c_768x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 10: Me metronoming at the top of Bridge of Heaven.</figcaption></figure></div><p>The race organizers really saved the best for last when you&#8217;re most tired, just to make sure they squeeze every last drop out of you. Bridge of Heaven ascends around 5000 feet over 5 miles, for 10 miles total on the out and back<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-30" href="#footnote-30" target="_self">30</a>. My girlfriend paced me. We set out around 5am up the trail. This one was weird because the sun came up about an hour in. The realization that I had been running all night and most of the previous day seemed to place a dense layer of grogginess over my brain. My girlfriend tried to kindle conversation, but the most I could manage in return was short, unthoughtful responses.</p><p>Eventually I turned into a metronome. It felt like my body was on autopilot, one foot after the other, driven by some low frequency clock. My mind was essentially absent of all thought for several hours, which is an <em>exceedingly</em> rare state for my psyche to be in. The mind had shed all cares and responsibilities save for the orchestration of my vitals and leg muscles.</p><p>Finally, we made it to the top. The last climb was over. After a few minutes of rest and some pictures, we started the last descent towards the finish at Fellin. The descent involved higher frequency metronome, but was otherwise uneventful. I was very sleepy and remember longing for one of life&#8217;s simple pleasures: sitting in a chair. After about an hour, we were down and headed back towards the finish line at Fellin.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PtiX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PtiX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PtiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg" width="384" height="847.448275862069" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:464,&quot;resizeWidth&quot;:384,&quot;bytes&quot;:91006,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182745484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PtiX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PtiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c24325-ca68-4747-9ba4-e59f36de8332_464x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 11: My final race data</figcaption></figure></div><p>I crossed the finish line with about an hour to spare before cutoff. My family and friends were there to give me hugs and congratulations. After I took some pictures with my new finisher buckle, I finally found a chair in the late morning sun. Sitting in that chair, knowing that I was done, my body and mind began to relax into a state of authoritative fatigue. The kind of fatigue that makes rest non-negotiable. The kind that demands <em>in</em>attention from your environment. I was nodding off in between conversations as the last few runners trickled in through the finish line. Once the podium ceremony was finished, we made our way back to the car. I slept the whole way home.</p><h2>Reflections</h2><p>Peaks and valleys. Tension and resolution. Moments of suffering that required me to go within myself and find a way to just keep going. The Ouray 50 was like a small fractal of life itself. It made me thankful for a healthy, resilient body and for a mind that is able to persevere through hard times and to enjoy the good times. The ups and downs of the race gave me a newfound appreciation of the inherent harmony of life. This harmony is ever present in my life, if only I would zoom out from the individual notes and listen to the whole song.</p><p>Family and friends. I&#8217;m still trying to understand what that &#8220;strange tinge&#8221; was that I felt when I saw them at Fellin aid station. The feeling that I was seeing them for the first time, and yet loved and cared for them deeply. As if a switch had flipped from cognitive priors to the senses. From predictions based on learned mental representations to a keen awareness of them in that moment. In <a href="https://www.amazon.com/Change-Your-Mind-Consciousness-Transcendence/dp/1594204225">How to Change Your Mind</a>, Michael Pollan describes a similar phenomenom commonly reported in psychedelic therapy sessions. In Chapter 5, he highlights the theory put forth by Robert Carhart-Harris et al. in their research, <a href="https://www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2014.00020/full">The entropic brain: a theory of conscious states informed by neuroimaging research with psychedelic drugs</a>, that different states of consciousness can be placed on a continuum of entropy. States such as anxiety, addiction, depression, and OCD are low entropy states that form deep grooves of cognition, creating a positive feedback loop with themselves. High entropy states, such as the adult brain on psychedelics or the brain of a young child, exhibit increased connectivity and creativity, as well as a diminished sense of separateness from other people and nature. The proposed mechanism behind this theory is an attenuation of the <a href="https://en.wikipedia.org/wiki/Default_mode_network">default mode network</a>, or lack of a fully developed one in the case of children, leads to high entropy states. The DMN is believed to be the physical &#8220;location&#8221; of our sense of self as well as the library of prior experiences, narratives, and labels we use to explain the world. It is believed that reducing DMN activity reduces the sense of self and lifts the veil imposed by the stories we tell about ourselves and others. If this theory is true, could it be that prolonged exercise such as ultra-marathons provide another way to reduce DMN activity? Does ultra-endurance increase the entropy level of the brain? What about other forms of exercise?</p><p>Ketosis. Is it optimal for ultra endurance? I don&#8217;t know. Maybe if you can stay aerobic the entire time. You probably don&#8217;t need to buy those ultra-processed sugary goos and gels that people peddle around.</p><p>It is safe to say that I didn&#8217;t eat 10,573 calories worth of fruit, so the idea that fat can be used as the primary fuel source for ultra events seems to hold up.</p><p>I am curious about how exactly that happens. The common narrative is that whenever you eat sugar, you spike your insulin, which locks you out of your own fat stores.</p><p>This must change during exercise in ways that I don&#8217;t fully understand. Perhaps the metabolic demand was so great that the residual glucose was immediately taken up by cells and any insulin spike is extremely short lived.</p><p>I definitely learned the importance of staying aerobic if you&#8217;re going to use fat for fuel. I ran the first half of the course a couple weeks before the race for a training run and ate the exact same high-fat trail mix that my body simply refused on race day.</p><p>During the training run, the trail-mix went down fine, but I also stayed aerobic the entire time. Clearly, going anaerobic for a few hours did not work out so well for me.</p><p>If I ever run a race like this again, my heart rate will be the key variable I watch, especially at the beginning. </p><p>Well, this was a long post. If you&#8217;re interested in some further information on low-carb and ketogenic diets, check out the following sources</p><ul><li><p>Dr. Peter Attia&#8217;s <a href="https://peterattiamd.com/">blog</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-31" href="#footnote-31" target="_self">31</a></p></li><li><p>Levels <a href="https://www.levels.com/">blog</a></p></li><li><p>Dr. Casey Mean&#8217;s <a href="https://www.caseymeans.com/goodenergy">Good Energy</a></p></li><li><p>Dr. Robert Lustig&#8217;s <a href="https://www.amazon.com/Metabolical-Processed-Nutrition-Modern-Medicine/dp/0063027712/ref=asc_df_0063027712?tag=bngsmtphsnus-20&amp;hvqmt=e&amp;hvlocint=&amp;psc=1">Metabolical</a></p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>It&#8217;s actually slightly over 50, but who&#8217;s counting really? </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>For some perspective, the <a href="https://www.leadvilleraceseries.com/run/leadvilletrail100run-2/">Leadville 100</a> has &#8220;only&#8221; ~15,700 feet of gain. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Here&#8217;s a <a href="https://www.complexity-explorables.org/explorables/echo-chambers">fun example </a>of complexity and echo chambers </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Because we ate a cyanide-laced apple, for example </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Reducing metabolism down to one hormone is obviously an <em>immense</em> simplification. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p><a href="https://www.amazon.com/Art-Science-Low-Carbohydrate-Performance/dp/0983490716">The Art and Science of Low Carbohydrate Performance</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p><a href="https://onlinelibrary.wiley.com/doi/10.1002/jnr.490180407">This in-vitro study</a> showed that astrocytes from rats were able to oxidize free fatty acids </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p><a href="https://www.sciencedirect.com/science/article/abs/pii/S1357272516304058">Romano A. et al, Fats for thoughts: An update on brain fatty acid metabolism, 2017</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>AMPK is also activated by metformin, a widely prescribed Type II diabetes drug that is <a href="https://www.foundmyfitness.com/topics/metformin">thought to have longevity benefits</a>. Perhaps metformin is mimicking a low-carb ketogenic diet? </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>&#946;-OHB is also a signaling molecule that stimulates the production of <a href="https://en.wikipedia.org/wiki/Sirtuin_2">SIRT2</a>, which in turn stimulates mitochondrial biogenesis.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC7062045/">M Felmlee et al., Monocarboxylate Transporters (SLC16): Function, Regulation, and Role in Health and Disease</a>, 2020</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p><a href="https://www.researchgate.net/profile/Denis-Barry/publication/324687999_The_ketogenic_diet_in_disease_and_development/links/5da5a21e299bf116fea9154a/The-ketogenic-diet-in-disease-and-development.pdf">Barry D. et al., The ketogenic diet in disease and development</a>, 2018</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p><a href="https://www.sciencedirect.com/science/article/pii/S221323171300058X">Ho E. et al. Biological markers of oxidative stress: Applications to cardiovascular research and practice</a>, 2013</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p><a href="https://www.sciencedirect.com/science/article/abs/pii/S0163782723000322?via%3Dihub">Kumar S.D. Kothapalli, Hui Gyu Park, Niharika S.L. Kothapalli, J. Thomas Brenna, FADS2 function at the major cancer hotspot 11q13 locus alters fatty acid metabolism in cancer, Progress in Lipid Research</a>, Volume 92, 2023</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p><a href="https://www.sciencedirect.com/science/article/pii/S221323171300058X">Ho E. et al. Biological markers of oxidative stress: Applications to cardiovascular research and practice</a>, 2013</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-16" href="#footnote-anchor-16" class="footnote-number" contenteditable="false" target="_self">16</a><div class="footnote-content"><p><a href="https://www.jbc.org/article/S0021-9258(18)30564-7/pdf">Abu-Soud H. and Hazen S., Nitric Oxide Modulates the Catalytic Activity of Myeloperoxidase</a>, 2000</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-17" href="#footnote-anchor-17" class="footnote-number" contenteditable="false" target="_self">17</a><div class="footnote-content"><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4239697/">Figueroa-Romero C. et al., Mechanisms of disease: the oxidative stress theory of diabetic neuropathy</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-18" href="#footnote-anchor-18" class="footnote-number" contenteditable="false" target="_self">18</a><div class="footnote-content"><p>See also: <a href="https://en.wikipedia.org/wiki/Maillard_reaction">Maillard Reaction</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-19" href="#footnote-anchor-19" class="footnote-number" contenteditable="false" target="_self">19</a><div class="footnote-content"><p><a href="https://sci-hub.ru/10.1002/med.21410">Matafome, P., Rodrigues, T., Sena, C., &amp; Sei&#231;a, R. (2016). Methylglyoxal in Metabolic Disorders: Facts, Myths, and Promises. Medicinal Research Reviews, 37(2), 368&#8211;403. doi:10.1002/med.2141</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-20" href="#footnote-anchor-20" class="footnote-number" contenteditable="false" target="_self">20</a><div class="footnote-content"><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4239697/">Figueroa-Romero C. et al., Mechanisms of disease: the oxidative stress theory of diabetic neuropathy</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-21" href="#footnote-anchor-21" class="footnote-number" contenteditable="false" target="_self">21</a><div class="footnote-content"><p><a href="https://link.springer.com/article/10.1007/s00401-007-0326-2">Kawamura, N. Inflammatory mediators in diabetic and non-diabetic lumbosacral radiculoplexus neuropathy</a>, 2007</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-22" href="#footnote-anchor-22" class="footnote-number" contenteditable="false" target="_self">22</a><div class="footnote-content"><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4239697/">Figueroa-Romero C. et al., Mechanisms of disease: the oxidative stress theory of diabetic neuropathy</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-23" href="#footnote-anchor-23" class="footnote-number" contenteditable="false" target="_self">23</a><div class="footnote-content"><p><a href="https://jpet.aspetjournals.org/content/310/2/728.short">Abdelmegeed M. et al., Acetoacetate Activation of Extracellular Signal-Regulated Kinase 1/2 and p38 Mitogen-Activated Protein Kinase in Primary Cultured Rat Hepatocytes: Role of Oxidative Stress</a>, 2004</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-24" href="#footnote-anchor-24" class="footnote-number" contenteditable="false" target="_self">24</a><div class="footnote-content"><p><a href="https://d1wqtxts1xzle7.cloudfront.net/48020922/Ketosis_acetoacetate_can_generate_oxyg20160813-20050-t8gjbb-libre.pdf?1471092667=&amp;response-content-disposition=inline%3B+filename%3DKetosis_acetoacetate_can_generate_oxygen.pdf&amp;Expires=1729982276&amp;Signature=Aueg7lX5sVBNhHJ9VwhFadUnDGb2RFi-UOzU3mjC8NO68ecJU6fYVFkxe3sPSL8HlmI1ab2emiEvUL5DUHl4DNU9g-Ur3RmSo8~0CXh3DD2d~qHerdtTCb4hu11rPYmTaXMaQDD7jq2QxQkhBpfWbJ~UwOPP8AZ3REiVwK6nrn5k5P~7kxAA8wtpagTO7ENlHanJm9gPNylv72ES~mGynTWWZlrB6tHMK2ceLGeUz457aBRNMOac7yFOVioEiF3WhVG2ZrPhMFUPBGdCCRl265Ntucc4NX-S2wIVfojQkYzln3tSGmca~LHMbGmTAARUeX7oBoxI7A4F0UOPmN~~tw__&amp;Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">Jain, S K. et al., Ketosis (acetoacetate) can generate oxygen  radicals and cause increased lipid peroxidation and growth inhibition in human endothelial cells</a>, 1998</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-25" href="#footnote-anchor-25" class="footnote-number" contenteditable="false" target="_self">25</a><div class="footnote-content"><p>This is an example of &#946;-OHB acting as a <em>signaling molecule</em> for the epigenome. It is more than just an alternative fuel source. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-26" href="#footnote-anchor-26" class="footnote-number" contenteditable="false" target="_self">26</a><div class="footnote-content"><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3735349/">Shimazu T. et al., Suppression of Oxidative Stress by &#946;-Hydroxybutyrate, an Endogenous Histone Deacetylase Inhibitor</a>, 2012</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-27" href="#footnote-anchor-27" class="footnote-number" contenteditable="false" target="_self">27</a><div class="footnote-content"><p><a href="https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-021-02185-0#ref-CR23">Kolb, H. et al., Ketone bodies: from enemy to friend and guardian angel</a>, 2021</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-28" href="#footnote-anchor-28" class="footnote-number" contenteditable="false" target="_self">28</a><div class="footnote-content"><p><a href="https://pubmed.ncbi.nlm.nih.gov/9853699/">Turk Z. et al., Comparison of advanced glycation endproducts on haemoglobin (Hb-AGE) and haemoglobin A1c for the assessment of diabetic control</a>, 1998</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-29" href="#footnote-anchor-29" class="footnote-number" contenteditable="false" target="_self">29</a><div class="footnote-content"><p><a href="https://www.nature.com/articles/s41387-020-00142-z.pdf">Yuan X. et al., Effect of the ketogenic diet on glycemic control,insulin resistance, and lipid metabolism in patientswith T2DM: a systematic review and meta-analysis</a>, 2020</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-30" href="#footnote-anchor-30" class="footnote-number" contenteditable="false" target="_self">30</a><div class="footnote-content"><p>Fun fact: the summit of Bridge of Heaven links up to the north end of <a href="https://www.alltrails.com/trail/us/colorado/engineer-pass-via-bear-creek-trail">Bear Creek National Recreation Trail</a>. The Bear Creek to Bridge of Heaven loop was one of my favorite and most difficult training runs </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-31" href="#footnote-anchor-31" class="footnote-number" contenteditable="false" target="_self">31</a><div class="footnote-content"><p>For a deep dive on the organic chemistry of ketone bodies and ketosis, start with <a href="https://peterattiamd.com/ketosis-advantaged-or-misunderstood-state-part-i/">this series of posts</a></p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Book Review: Man's Search for Meaning]]></title><description><![CDATA[Viktor Frankl&#8217;s Man&#8217;s Search for Meaning is a profound exploration of meaning in human life.]]></description><link>https://www.connorjdavis.com/p/book-review-mans-search-for-meaning</link><guid isPermaLink="false">https://www.connorjdavis.com/p/book-review-mans-search-for-meaning</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Tue, 24 Sep 2024 03:15:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Viktor Frankl&#8217;s <a href="https://www.amazon.com/Mans-Search-Meaning-Viktor-Frankl/dp/0807014273/ref=sr_1_1?dib=eyJ2IjoiMSJ9.phUc2adLzv8edJKcJfYGiYmYQA_wfSFAPr6oNqdjFnYmt7SWN9z7L8MkQHHV_2cQhNp7QM_F6ZT3gYG42c39tLLjzIuf9J8-f3ju0BLqXbV-h-60PPQkwnS0DOSnmzucSxlQzs-d7qj-aNLnNBOIjTVro0BGkj-GaW_2AXdspw-uplditX8KE_0QYVArjUYHEmxwaRx2jimUjDdDnqclVR9rMuvi4QEzkVcawsOWNZA.zV4oyNGi2efHmJhNcH8YEERh0-Ujejx-pxHuzLyx07o&amp;dib_tag=se&amp;hvexpln=67&amp;hvocijid=308217410111159415--&amp;hvqmt=e">Man&#8217;s Search for Meaning</a> is a profound exploration of meaning in human life. The book begins with a harrowing account of the author&#8217;s personal experience in Nazi concentration camps. He uses this account to illustrate the distribution of prisoner personas that manifested as a result of the damaging influences of camp. The typical prisoner used apathy to defend themselves. Some used humor. Others were fatally nihilistic. A few were status-seeking people-pleasers trying to win favor with other prisoners and guards.</p><p>Despite this wide variance in personas, every prisoner was in the same environment. What was different was how each prisoner related to their environment internally. The book argues that these differences were caused by different <em>meanings</em> that each prisoner held for their life. Those that were able to still find meaning in their life were most likely to endure the daily stress of camp. Others that had no meaning, or lost their meaning, were most likely to give up. They would refuse to eat, refuse to move, and would lie in their own excrement until they died. This suggests that the answer to the age-old question &#8220;<em>what is the meaning of life?</em>&#8221; is more than just a thought exercise; it is critical to our survival and well-being. Frankl believes that pursuit of meaning is a fundamental human drive (the &#8220;will to meaning&#8221;), and suggests that meaning, not pleasure (a la Freud), is the primary motivating factor in life.</p><p>Frankl eloquently describes how we can answer &#8220;<em>what is the meaning of life?&#8221;</em> in our own lives. One thing we have to understand is that meaning is personal. It is unique to each individual. Meaning isn&#8217;t something that we can outsource to our culture or other people. It comes from deep within us, from our gut and our heart. Frankl suggests that often when we ask &#8220;<em>what is the meaning of life?</em>&#8220;, we tend to answer in such a way that involves us acquiring or extracting things from life. The answer tends to underscore a cultural metaphor that life is a finite resource, ready to exploited for our own gain and pleasure. To find our meaning, he recommends to flip that perspective around. What matters is not our expectation of life, but <em>what life expects of us</em>.</p><h2>What does life expect of you?</h2><p>The meaning of <em>your</em> life comes from authentically answering that question and orienting your life in pursuit of actualizing the answer. The process of confronting one&#8217;s meaning and orienting one&#8217;s life towards it forms the basis of Frankl&#8217;s <a href="https://en.wikipedia.org/wiki/Logotherapy">logotherapy</a> approach. Logotherapy requires listening to and honoring your gut and heart, which may be at odds with your mind. This inner tension is actually a good sign that you&#8217;ve identified meaning for your life (and reminds me of Resistance acting as a gradient for self-actualization described by Steven Pressfield in the <a href="https://www.amazon.com/War-Art-Through-Creative-Battles/dp/1936891026">War of Art</a>). Frankl argues that true wholeness and human thriving is dependent upon creating this tension (by confronting one&#8217;s meaning) and then working to resolve it via &#8220;right action&#8221;, i.e., pursuing your meaning through your daily actions.</p><blockquote><p>&#8220;What man needs is not a tensionless state but rather the striving and struggling for a worthwhile goal, a freely chosen task. A call of a potential meaning waiting to be fulfilled by him.&#8221; &#8212; Viktor Frankl</p></blockquote><p>Despite meaning being personal, Frankl suggests there are three general sources of meaning. The first is creation. Humans are especially gifted in that we can create things; solutions to hard problems, art, <a href="https://paulgraham.com/wealth.html">wealth</a>, or other humans. Heeding the instinctual call to create something of lasting value can bring immense meaning in one&#8217;s life. </p><p>The second is love. Creating a strong relationship with a spouse, friends, and community aligns with our fundamental need to social connection. The meaning in love is about seeing the essential traits of a person and the unrealized potential (their own meaning) which ought to be actualized, and then enabling them to actualize it. </p><p>The third is suffering. Frankl suggests that life can have meaning despite ineffable suffering, as he experienced first hand in Nazi concentration camps.</p><blockquote><p>He who has a why to live for can bear almost any how &#8212; Friedrich Nietzsche</p></blockquote><p>Maintaining your positive character and persevering through adversity are two things that provide meaning during times of suffering. Suffering challenges us to change ourselves; there is no growth without suffering and pain.</p><p>Frankl believes that many of our collective neuroses, e.g., nihilism, anxiety, depression, can be explained in terms of meaning, or rather lack of meaning in our lives. Whenever we lose meaning orientation in our life, it creates a so-called &#8220;existential vacuum&#8221; that we tend to fill with pleasure or power<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> seeking behavior. On the other hand, having a strong meaning for your life provides a foundation for your behavior and mental health. When we have a foundational &#8220;why&#8221; for our life, it makes doing things that require discipline (e.g. regular exercise, consistently eating real food, writing) easier to stick to. And since our meaning comes from within us, it will be intrinsically aligned with our <a href="https://connorjdavis.substack.com/p/core-values">core values</a>. This alignment will foster the free expression of our core values through our behavior and thought patterns, a necessary condition for mental health.</p><h2>Personal Thoughts</h2><p>If meaning originates from our true self, then we need to have a strong connection to our whole self (including our gut instinct) as a prerequisite for finding meaning in our lives. From a <a href="https://connorjdavis.substack.com/p/book-review-the-myth-of-normal">previous post</a>, we know that trauma causes disconnection from our whole self, including our gut instinct. Therefore, finding meaning may require us to first heal trauma to reconnect with our self so that we may hear the full message that our gut tries to send us. </p><p>Once we&#8217;ve identified some potential meaning in our life, we have to act on it. This is the source of tension that Frankl describes. Whenever our potential meaning requires us to switch careers, or end unhealthy relationships (with people or bad information<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>), this can represent fundamental changes in our life that are difficult to follow through on. Frankl argues that working to resolve this tension through our actions, i.e., pursuing our life&#8217;s meaning, is a necessary condition for mental health.</p><p>The question &#8220;what does life expect of you?&#8221; is a great frame to think about meaning in our lives, however I think it deserves caution not to take it too far. No matter how much you think life &#8220;expects&#8221; of you, you still have to defend your own boundaries and ensure your needs are met. Sacrificing your health or relationships for goals, no matter how ambitious or noble they may be, misses the forest for the trees. You are no good to anyone if you burn yourself out. One thing that can help here is to have a low time preference. Don&#8217;t try to solve every problem in a week. Identify opportunities for long-term value creation, and consistently work toward them in a sustainable way.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I take Frankl to mean &#8220;power&#8221; in the sense of title acquisition via finite game play. I will be writing about power and title in the context of <a href="https://www.amazon.com/Finite-Infinite-Games-James-Carse/dp/1476731713">finite vs. infinite games</a> in a later post </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Here I mean information in the broadest sense: food, media (social and otherwise), lack of exercise, social interactions, subjects of addiction, etc.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Book Review: The Myth of Normal]]></title><description><![CDATA[I recently finished The Myth of Normal by Dr. Gabor Mate]]></description><link>https://www.connorjdavis.com/p/book-review-the-myth-of-normal</link><guid isPermaLink="false">https://www.connorjdavis.com/p/book-review-the-myth-of-normal</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Wed, 14 Aug 2024 02:22:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tgWd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently finished <a href="https://www.amazon.com/Myth-Normal-Illness-Healing-Culture/dp/0593083881">The Myth of Normal</a> by Dr. Gabor Mate. The book is a treatise on trauma; it describes at length what trauma is, the causes and effects of trauma, and how to heal from trauma. The lessons and insights from this book have provided me with a set of mental models and practices that I&#8217;ve already integrated into my life. I recommend it to anyone curious about trauma as a psychological phenomenon, how it relates to the self and culture, and novel approaches to healing. A summary of the key arguments from the book are below, followed by some of my own thoughts and personal experiences.</p><h2>What is Trauma?</h2><p>Trauma is somewhat of a nebulous thing that is hard to concretely define, kind of like love. The book does a great job of breaking it down, first discussing what it is not, and then providing a definition and mental model that help the reader understand its nature.</p><p>Trauma is often conflated with stress in everyday language. They share some characteristics, but they aren&#8217;t the same thing. Both stress and trauma may lead to similar physiological effects such as autoimmune disorders and insulin resistance, as well as mental conditions such as anxiety and depression. However, they differ in locus of affliction and duration. Stress is a <em>physiological </em>response of the body to what is happening <em>to you</em> in your environment; it has an external locus of affliction and environment-dependent duration. Fix the environment, and the stress response dissipates.</p><p>On the other hand, trauma is what happens <em>inside you</em>; it is a <em>psychological</em> response to adverse experiences and therefore has an internal locus of affliction. The duration is environment-independent in the sense that improving the environment does nothing to heal the wound.</p><p>The essence of trauma is a <em>fracturing of the self</em>; a psychological dismemberment into parts expressed that comprise the conditioned personality everyone sees, and parts suppressed as mitigation to the perceived threat they pose to the <a href="https://en.wikipedia.org/wiki/Attachment_theory">social attachment</a> required for survival. The fracturing occurs subconsciously, altering the architecture of our <a href="https://en.wikipedia.org/wiki/Figure%E2%80%93ground_(perception)">psychological ground</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tgWd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tgWd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 424w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 848w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 1272w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tgWd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png" width="1008" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1008,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:662690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://connorjdavis.substack.com/i/182742424?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tgWd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 424w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 848w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 1272w, https://substackcdn.com/image/fetch/$s_!tgWd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea99c165-2b76-41b5-8f2d-1c10dd47e5f7_1008x642.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1. An example of visual figure-ground by MC Escher. The figure-ground phenomenon also exists in our psyche.</figcaption></figure></div><p>This alteration leaves an imprint on our psyche that is triggered whenever we subconsciously perceive similar adverse situations as adults. Even if we are consciously aware that our survival is not at risk later on, the subconscious imprint may perceive a similar threat to our survival, and tries to protect us by shutting parts of ourselves down or activating maladaptive behaviors or beliefs.</p><blockquote><p>&#8220;Until you make the unconscious conscious, it will direct your life, and you will call it fate&#8221; &#8212; Carl Jung</p></blockquote><p>Finally, trauma is persistent. Once our psyche is fractured, it remains that way unless and until it is made whole again. In other words, time heals all wounds, except for trauma. Why? Because humans prioritize survival above all else, and trauma is a potent, persistent expression of that survival instinct.</p><h2>What Causes Trauma?</h2><p>We tend to constrain our conception of trauma to things like natural disasters, war, severe neglect, and physical abuse. But restricting the definition to catastrophes would imply trauma consists of infrequent external events. Dr. Mate argues that overt tragedies cause only one form of trauma called &#8220;big-T&#8221; trauma. Big-T trauma occurs when <em>bad things happen</em> to people. Things like sexual abuse or loss of a parent. The other form, called &#8220;small-t&#8221; trauma, occurs when <em>good things don&#8217;t happen</em>, specifically when the core psycho-social needs of a child go unmet. These needs of early childhood are: emotional and physical attunement of parents, a sense of unconditional worthiness of that attunement, a sense of safety in expressing authentic emotions, and agenda-free, person-to-person interactive play.</p><p>Dr. Mate argues that the inclusion of small-t trauma with big-T trauma suggests that we should not view trauma as a rigid yes/no, but rather a spectrum that everyone is on. This does not discount the fact that some people endure more trauma than others, but rather that everyone should honestly look at their own experience and expect to find some evidence of trauma, and that many of our maladaptive behaviors and destructive relationship patterns in adulthood can be traced back to unresolved trauma.</p><h2>What does Trauma Cause?</h2><p>Dr. Mate presents many potential downstream effects of trauma, supported by research and anecdotes from his clinical experience. Some of the effects, if proven to be causal, would be quite alarming and would necessitate a shift in our approach to treating many afflictions that pervade modern life. He provides evidence that many cases of chronic illnesses such as cancer and autoimmunity, as well as mental disorders such as anxiety, depression, and addiction may have a root cause of unresolved trauma. This seems reasonable, since we know unresolved trauma persists through time, and therefore all of its downstream destructive habits and self-abnegation cause persistent damage to our bodies and minds.</p><h2>How to Heal Trauma?</h2><p>So far this book may seem pretty bleak. Humans have an instinctual response to the environment that occurs when we have zero control over it that leaves a persistent imprint of subconscious behaviors and thought patterns that not only cause chronic disease, but also pass down to our own children.</p><p>The good news is that we can do something about it.</p><p>We know what trauma is &#8211; a fracturing of the self. Healing is a process to restore wholeness of the self. To integrate the repressed emotions and parts of our selves with our conditioned personality so that we may finally live according to our authentic essence.</p><p>So how do we heal? What does it look like? The book describes various things within our control we can do in order to restore ourselves. Before I review those though, there are two things described in the book that should be avoided: blame and comparison.</p><p>Blaming someone for our trauma, even if what they did undeniably caused big-capital-T-all-caps-TRAUMA!!, is antithetical to healing, because it implicitly assumes the problem is the <em>person</em> instead of what the person caused to happen <em>inside of us</em>. Healing is not about being right and using that as justification of perpetual victimhood. Placing blame on one person or event ignores the possibility that trauma is systemic in our collective subconscious and that unsocial behaviors levied upon us by others are they themselves the shadow of an accumulation of unresolved trauma in their own life. We are not responsible for what other people do to us, but we are responsible for how we relate to their actions.</p><p>Similarly, comparing your trauma to others isn&#8217;t helpful for healing either. Just because someone may have had a harder life than you according to some cultural measuring stick, doesn&#8217;t invalidate your own harmful experiences. Telling yourself &#8220;I should be happy, I had it pretty good, especially compared to this person over here&#8221; only induces guilt and shame. It is a further rejection of your emotions rather than a reintegration of them. Comparing in the other direction doesn&#8217;t help either. If we think our traumatic experience is so much worse than others, it can give rise to jealousy, resentment, self-pity and even self-aggrandizement. None of these mindsets provide any true healing at all.</p><p>After highlighting some things to avoid in the healing process, the book goes on to describe some concepts and techniques that you can use to regain your whole self. One of these concepts is the &#8220;4 A&#8217;s&#8221;. Healing involves the development of each of these 4 A&#8217;s in our life:</p><p>1. Authenticity. In order to heal, we have to start listening to our self and shed our self-abnegating social character. Some social attachments may not survive the authentic expression of ourselves, but new attachments will be made based on who we truly are.</p><p>2. Agency. When we reframe trauma as something that happened inside us, it allows us to let go of the particular events that led to it. This means that we no longer have to be controlled by the subconscious impulses that trauma causes; we can have control over our lives again.</p><p>3. Anger. Healthy anger is expressed in the moment, not repressed down inside us. Healthy anger represents a decisive &#8220;no&#8221;; a boundary that we have to learn to honor.</p><p>4. Acceptance. This means being present in the moment, and not resisting the emotions that may come with it, uncomfortable as they may be.</p><p>Another key concept is <em>awareness</em>. Since trauma takes place subconsciously, you have to bring the wound into conscious awareness. This requires intention, hard work and a serious exploration of the self. You have to ask yourself hard questions; questions you initially may not want to know the answer to. Dr Mate provides a few examples that you can use (from his <a href="https://compassionateinquiry.com/">compassionate inquiry</a> process):</p><ul><li><p>What am I not saying &#8220;no&#8221; to that I should be? When did I sense a &#8220;no&#8221; but repressed it?</p></li><li><p>How does my inability to say no impact my life?</p></li><li><p>What bodily signals have I been overlooking?</p></li><li><p>What is the hidden story behind my inability to say no?</p></li><li><p>Where did I learn those stories?</p></li><li><p>What am I not saying &#8220;yes&#8221; to that I should be? What have I wanted to do, create, or express, but haven&#8217;t out of fear?</p></li></ul><p>Getting truthful answers to these questions is difficult due to how buried some of these wounds can be in our psyche and how much pain and fear they trigger &#8211; some memories can be so utterly suppressed that we have no conscious recollection of them ever happening. Due to this inherent difficulty, the book recommends both therapy and psychedelics (in a therapeutic setting) to help peel the psychological onion and to guide the exploration and integration of emotions that bubble up into our conscious awareness.</p><p>Despite the systemic nature of trauma in our lives and culture, and the myriad adverse consequences it has on us, the message of healing is hopeful: it&#8217;s not about what happened in the past, but how we relate to our past. If we are willing to do the work, we can resolve our trauma and live the life of wholeness that we are meant to live.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/p/book-review-the-myth-of-normal?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.connorjdavis.com/p/book-review-the-myth-of-normal?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Personal Thoughts</h2><p>This book is long, but well worth the time if you are interested in learning about trauma and ways to heal. One of the main contributions of the book, in my view, is the conceptualization of what trauma is. I only had a vague notion of what trauma was before, but now I understand it in a way that allows me to identify and label it, which helps to bring awareness of and separation between trauma and myself.</p><p>Also, this post tends to focus on the childhood causes of trauma, but the book explains that trauma is not restricted to just children. The same fracturing can and does happen in adults as well.</p><p>The section on psychedelics resonated with me as well. I have experienced first hand the &#8220;emulsifying&#8221; effect that psilocybin mushrooms have between the conscious and subconscious. Daily life consists of the same patterns playing out over and over, like traffic that creates deep ruts on a dirt road. These ruts become so deep that it makes turning onto other roads impossible. Psilocybin is like a great flood that erodes the ruts down into a level surface, providing smooth passage to the other roads once again.</p><p>I do disagree with some of the book&#8217;s claims that capitalism causes or (at least perpetuates) trauma. My main contention is that the author doesn&#8217;t actually define what he <em>means</em> by &#8220;capitalism&#8221;. Based on the arguments he presents, <em>I think</em> his definition of capitalism means the &#8220;modern credit-based financial system led by the hegemonic power of the Federal Reserve and the United States&#8221;. If that is the case, then I disagree with him on <em>his definition of capitalism</em> and not so much that <em>his notion of capitalism perpetuates trauma</em>. Indeed, I think our fiat credit-based financial system, propped up by <a href="https://www.tftc.io/recession-in-us/">exponentially increasing debt</a> and debasement of our currency is the root cause of the inequality, division, and populism in our culture today. If we take capitalism to mean &#8220;free markets&#8221;, then the system we have isn&#8217;t really capitalism. The system we have today bails out banks that are too big to fail (privatized gains, socialized losses). The system we have today features a cabal of unelected central planners to distort the price of money. I have a lot more to say on this topic, but this post is already long enough, and it really deserves a dedicated post (or series of posts).</p><p>Despite these disagreements, the book is well worth the read. The depth of discussion regarding the various facets of trauma, as well what we can do about it in our own lives, leaves the reader empowered to start living according to their authentic self.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Core Values]]></title><description><![CDATA[I&#8217;ve decided to start a blog.]]></description><link>https://www.connorjdavis.com/p/core-values</link><guid isPermaLink="false">https://www.connorjdavis.com/p/core-values</guid><dc:creator><![CDATA[Connor Davis]]></dc:creator><pubDate>Tue, 23 Jul 2024 02:58:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbOo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16ac54db-581c-4d87-897b-1a07019f089d_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve decided to start a blog.</p><p>Lately I&#8217;ve had this gut feeling that I should start writing. I think this comes from a need to integrate and express the ideas I become exposed to from media like books and podcasts.</p><p>Where does this need come from? I believe it springs from one of my core values; authenticity. To me, living an authentic life means sincerely and openly expressing one&#8217;s ideas and beliefs, even if it means damaging (or losing) interpersonal relationships with loved ones due to differences in opinion.</p><p>For most of my life, this fear of loss has led me to suppress authentic expression for my ideas and interests. I&#8217;ve learned the root cause of this fear is based in the psychology of (insecure) attachment. <a href="https://en.wikipedia.org/wiki/Attachment_theory">Attachment</a> refers to the set of emotional and social patterns present in our interpersonal relationships.</p><p>As children, when any of our core social, emotional, and safety needs are not met, we modify our outward behavior to preserve our attachment with caregivers. This behavior change stems deep from the survival instinct and leads to suppression of authentic emotions. These behavior patterns are imprinted into the our subconscious and manifest later in adolescent and adult relationships as either excessive neediness (anxious attachment), dismissiveness (avoidant attachment), or a mix of both (disorganized attachment) and people-pleasing behavior.</p><p>Writing this blog is one way for me to overcome the fear of loss of attachment and express my authentic ideas and interests. My ultimate goal is to use this blog to connect with myself and others in an authentic way. Writing aligns with my other core values as well:</p><ul><li><p>Health - you can&#8217;t really do anything worthwhile in life if you aren&#8217;t healthy. Consistent exercise and eating right are huge components of this.</p></li><li><p>Authenticity - expressing what is on your mind, especially when it feels like it will be unpopular </p></li></ul><ul><li><p>Discipline &#8211; I&#8217;m committing to a regular writing schedule. At least one post a month, for now.</p></li><li><p>Growth mindedness &#8211; I&#8217;m not a great writer, but I know that I will get better over time with consistent effort. Writing about a topic will also force me to understand it more deeply.</p></li><li><p>Connection &#8211; my hope is that someone will find the ideas discussed in this blog helpful, and that they walk away having learned something that can make a positive impact on their life.</p></li></ul><p>This year is the first in my life where I&#8217;ve actually thought through and written down what my core values are. To find them, I thought about scenarios in the past when I felt awful over something I did or didn&#8217;t do, and the &#8220;inverse&#8221; of that (mis)behavior pointed to the underlying value that needs to be honored in order not to feel that way in the future. I&#8217;m sure there are other methods for finding your core values, but this way has worked for me, maybe it will work for others too. The easy part is taking the time to think through them, and write them down. The hard part is living them.</p><p>What are your core values?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.connorjdavis.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>