### Animals

All protocols and animal dealing with procedures have been carried out in strict accordance with a protocol (no. 19–190) accepted by the Janelia Institutional Animal Care and Use Committee and in compliance with the requirements set forth by the Affiliation for Evaluation and Accreditation of Laboratory Animal Care.

For behaviour and juxtacellular recordings, we used 24 grownup male DAT-Cre::ai32 mice (3–9 months outdated) ensuing from the cross of DAT^{IREScre} (The Jackson Laboratory inventory 006660) and Ai32 (The Jackson Laboratory inventory 012569) traces of mice, such {that a} Chr2–EYFP fusion protein was expressed beneath management of the endogenous dopamine transporter *Slc6a3* locus to particularly label dopaminergic neurons. Mice have been maintained beneath specific-pathogen-free situations. Mice have been housed on a free-standing, individually ventilated (about 60 air modifications hourly) rack (Allentown Inc.). The holding room was ventilated with 100% outdoors filtered air with >15 air modifications hourly. Every ventilated cage (Allentown) was supplied with corncob bedding (Shepard Specialty Papers), not less than 8 g of nesting materials (Mattress-r’Nest, The Andersons) and a crimson mouse tunnel (Bio-Serv). Mice have been maintained on a 12:12-h (8 am–8 pm) gentle/darkish cycle and recordings have been made between 9 am and three pm. The holding room temperature was maintained at 21 ± 1 °C with a relative humidity of 30% to 70%. Irradiated rodent laboratory chow (LabDiet 5053) was offered advert libitum. Following not less than 4 days restoration from headcap implantation surgical procedure, animals’ water consumption was restricted to 1.2 ml per day for not less than 3 days earlier than coaching. Mice underwent every day well being checks, and water restriction was eased if mice fell under 75% of their unique physique weight.

### Behavioural coaching

Mice have been habituated to go fixation in a separate space from the recording rig in a number of classes of accelerating size over ≥3 days. Throughout this time they acquired some guide water administration by way of a syringe. Mice have been then habituated to go fixation whereas resting in a spring-suspended basket within the recording rig for not less than two classes of 30+ min every earlier than coaching commenced. No liquid rewards have been administered throughout this recording rig acclimation; thus, trial 1 within the information represents the primary time naive mice acquired the liquid water reward within the coaching surroundings. The reward consisted of three μl of water sweetened with the non-caloric sweetener saccharin delivered by way of a lick port beneath management of a solenoid. A 0.5-s, 10-kHz tone preceded reward supply by 1.5 s on ‘cued’ trials, and 10% of randomly chosen rewards have been ‘uncued’. Matching our earlier coaching schedule^{9}, after three classes, mice additionally skilled ‘omission’ probe trials, during which the cue was delivered however not adopted by reward, on 10% of randomly chosen trials. Intertrial intervals have been chosen from a randomly permuted exponential distribution with a imply of about 25 s. Ambient room noise was 50–55 dB, and an audible click on of about 53 dB accompanied solenoid opening on water supply and the predictive tone was about 65 dB loud. Mice skilled 100 trials per session and one session per day for 8–10 days. In earlier pilot experiments, it was noticed that at comparable intertrial intervals, behavioural responses to cues and rewards started to lower in some mice at 150–200 trials. Thus, the 100 trials per session restrict was chosen to make sure homogeneity in motivated engagement throughout the dataset.

Some animals acquired optogenetic stimulation of VTA–DA neurons concurrent with reward supply, contingent on their behaviour through the delay interval (see technical particulars under). Mice have been randomly assigned to stimulation group (management, stimLick−, stimLick+) earlier than coaching. Experimenter was not blinded to group id throughout information assortment. Following hint conditioning with or with out exogenous dopamine stimulation, 5 mice skilled an additional session throughout which VTA–DA neurons have been optogenetically stimulated concurrently with cue presentation (Prolonged Knowledge Fig. 4). Mice have been then randomly assigned to teams for a brand new experiment during which a lightweight cue predicted VTA–DA stimulation with no concurrent liquid water reward (5–7 days, 150–200 trials per day). The sunshine cue consisted of a 500-ms flash of a blue light-emitting diode (LED) directed on the wall in entrance of head fixation. Intertrial intervals have been chosen from randomly permuted exponential distributions with a imply of about 13 s. Supplementary Desk 1 lists the experimental teams every mouse was assigned to within the order during which experiments have been skilled.

### Video and behavioural measurement

Face video was captured at 100 Hz constantly throughout every session with a single digicam (Flea 3, FLIR) positioned degree with the purpose of head fixation, at an roughly 30º angle from horizontal, and compressed and streamed to disk with {custom} code written by J. Keller (out there at https://github.com/neurojak/pySpinCapture). Dim seen gentle was maintained within the rig in order that pupils weren’t overly dilated, and an infrared LED (mannequin#) skilled on the face offered illumination for video seize. Video was post-processed with {custom} MATLAB code out there on request.

Briefly, for every session, an oblong area of curiosity (ROI) for every measurement was outlined from the imply of 500 randomly drawn frames. Pupil diameter was estimated because the imply of the most important and minor axis of the thing detected with the MATLAB regionprops perform, following noise removing by thresholding the picture to separate gentle and darkish pixels, then making use of a round averaging filter after which dilating and eroding the picture. This noise removing course of accounted for frames distorted by passage of whiskers in entrance of the attention, and slight variations in face illumination between mice. For every session, appropriateness of match was verified by overlaying the estimated pupil on the precise picture for about 20–50 randomly drawn frames. A single variable, the darkish/gentle pixel thresholding worth, may very well be modified to make sure optimum becoming for every session. Nostril movement was extracted because the imply of pixel displacement within the ROI *y* axis estimated utilizing a picture registration algorithm (MATLAB imregdemons). Whisker pad movement was estimated as absolutely the distinction within the whisker pad ROI between frames (MATLAB imabsdiff; this was sufficiently correct to outline whisking durations, and required a lot much less computing time than imregdemons). Whisking was decided because the crossing of pad movement above a threshold, and whisking bouts have been made steady by convolving pad movement with a smoothing kernel. Licks have been timestamped because the second pixel depth within the ROI in between the face and the lick port crossed a threshold.

Physique motion was summarized as basket actions recorded by a triple-axis accelerometer (Adafruit, ADXL335) hooked up to the underside of a custom-designed three-dimensionally printed basket suspended from springs (Century Spring Corp, ZZ3-36). Relative basket place was tracked by low-pass filtering accelerometer information at 2.5 Hz. Stimulations and cue deliveries have been coordinated with custom-written software program utilizing Arduino Mega {hardware} (https://www.arduino.cc). All measurement and management alerts have been synchronously recorded and digitized (at 1 kHz for behavioural information, 10 kHz for fibre photometry information) with a Cerebus Sign Processor (Blackrock Microsystems). Knowledge have been analysed utilizing MATLAB software program (Mathworks).

### Preparatory and reactive measures and summary studying trajectories

To explain the connection between behavioural diversifications and reward assortment efficiency, for every mouse within the management group a GLM was created to foretell reward assortment latency from preparatory and reactive predictor variables on every trial. Preparatory modifications in licking, whisking, physique motion and pupil diameter have been quantified by measuring the common of every of these alerts through the 1-s delay interval previous cued rewards. The nostril movement sign was not included because it didn’t show constant preparatory modifications. Reactive responses within the whisking, nostril movement and physique motion have been measured because the latency to the primary response following reward supply. For whisking, this was merely the primary second of whisking following reward supply. For nostril movement, the uncooked sign was convolved with a smoothing kernel after which the primary response was detected as a threshold crossing of the cumulative sum of the sign. For physique motion, the response was detected as the primary peak within the information following reward supply. On occasional trials no occasion was detected throughout the evaluation window. Moreover, discrete blocks of trials have been misplaced owing to information assortment error for mouse 3, session 7; mouse 4, session 5; and mouse 9, session 4. To suit studying curves by way of these absent information factors, lacking trials have been crammed in utilizing nearest-neighbour interpolation.

Trial-by-trial reward assortment latencies and predictor variables (preparatory licking, whisking, physique motion and pupil diameter; and reactive nostril motions, whisking and physique motion) have been median filtered (MATLAB medfilt1(sign,10)) to reduce trial-to-trial variance in favour of variance resulting from studying throughout coaching. Assortment latency was predicted from *z*-scored predictor variables utilizing MATLAB glmfit to suit *β*-values for every predictor. The distinctive defined variance of every predictor was calculated because the distinction in defined variance between the total mannequin and a partial mannequin during which *β*-values have been fitted with out utilizing that predictor.

Preparatory and reactive predictor variables have been used to outline summary studying trajectories that have been plots of assortment latency in opposition to the inferred reactive and preparatory variables for every of the primary 800 cue–reward trials of coaching. Reactive and preparatory variables have been calculated as the primary principal element of the person reactive and preparatory variables used within the GLM suits. For visualization, we fitted a parametric mannequin to all three variables (single exponential for preparatory, double exponentials for reactive and latency utilizing the MATLAB match perform). High quality of suits and selection of mannequin have been verified by visible inspection of all information for all mice. A person mouse’s trajectory was then visualized by plotting downsampled variations of the match features for latency, reactive and preparatory. Arrowheads have been positioned at logarithmically spaced trials.

To quantify the overall quantity of preparatory behaviour in every mouse at a given level in coaching (closing prep. behav., Prolonged Knowledge Fig. 3f), every preparatory measure (pupil, licking, whisking and physique motion) was *z*-scored and mixed throughout mice right into a single information matrix. The primary principal element of this matrix was calculated and loading onto PC1 was outlined as a measure of an inferred underlying ‘preparatory’ element of the behavioural coverage. This created an equally weighted, variance-normalized mixture of all preparatory measures to permit comparisons between particular person mice. A similar methodology was used to cut back the dimensionality of reactive variables all the way down to a single ‘reactive’ dimension that captures most variance in reactive behavioural variables throughout animals (closing reactive behav., Prolonged Knowledge Fig. 3g). Preliminary NAc–DA alerts have been predicted from skilled behaviour at trials 700–800 by a number of regression (particularly, pseudoinverse of the info matrix of reactive and preparatory variables on the finish of coaching multiplied by information matrix of physiological alerts for all animals).

### Mixed fibre photometry and optogenetic stimulation

In the middle of a single surgical procedure session, DAT-Cre::ai32 mice acquired: bilateral injections of AAV2/1-CAG-FLEX-jRCaMP1b within the VTA (150 nl on the coordinates −3.1 mm anterior–posterior (A–P), 1.3 mm medial–lateral (M–L) from bregma, at depths of 4.6 and 4.3 mm) and within the substantia nigra pars compacta (100 nl on the coordinates −3.2 mm A–P, 0.5 mm M–L, depth of 4.1, mm); {custom} 0.39-NA, 200-μm fibre cannulas implanted bilaterally above the VTA (−3.2 mm A–P, 0.5 mm M–L, depth of −4.1 mm); and fibre cannula implanted unilaterally within the DS (0.9 mm A–P, 1.5 mm M–L, depth of two.5 mm) and NAc (1.2 mm A–P, 0.85 mm M–L, depth of 4.3 mm). Hemisphere alternative was counterbalanced throughout people. An in depth description of the strategies has been printed beforehand^{56}.

Imaging started >20 days post-injections utilizing custom-built fibre photometry programs (Fig. 2a)^{56}. Two parallel excitation–emission channels by way of a five-port filter dice (FMC5, Doric Lenses) allowed for simultaneous measurement of RCaMP1b and eYFP fluorescence, the latter channel having the aim of controlling for the presence of motion artefacts. Fibre-coupled LEDs of 470 nm and 565 nm (M470F3, M565F3, Thorlabs) have been related to excitation ports with acceptance bandwidths of 465–490 nm and 555–570 nm, respectively, with 200-μm, 0.22-NA fibres (Doric Lenses). Mild was conveyed between the pattern port of the dice and the animal by a 200-μm-core, 0.39-NA fibre (Doric Lenses) terminating in a ceramic ferrule that was related to the implanted fibre cannula by a ceramic mating sleeve (ADAL1, Thorlabs) utilizing index matching gel to enhance coupling effectivity (G608N3, Thorlabs). Mild collected from the pattern fibre was measured at separate output ports (emission bandwidths 500–540 nm and 600–680 nm) by 600-μm-core, 0.48-NA fibres (Doric Lenses) related to silicon photoreceivers (2151, Newport).

A time-division multiplexing technique was used during which LEDs have been managed at a frequency of 100 Hz (1 ms on, 10 ms off), offset from one another to keep away from crosstalk between channels. A Y-cable break up every LED output between the filter dice and a photodetector to measure output energy. LED output energy was 50–80 μW. This low energy mixed with the ten% responsibility cycle used for multiplexing prevented native ChR2 excitation^{56} by 473 nm eYFP excitation. Excitation-specific alerts have been recovered in post-processing by solely protecting information from every channel when its LED output energy was excessive. Knowledge have been downsampled to 100 Hz, after which band-pass filtered between 0.01 and 40 Hz with a second-order Butterworth filter. Though motion artefacts have been negligible when mice have been head-fixed within the rig (the movable basket was designed to reduce mind motion with respect to the cranium^{9}), in response to customary process the least-squares match of the eYFP motion artefact sign was subtracted from the jRCaMP1b sign. d*F*/*F* was calculated by dividing the uncooked sign by a baseline outlined because the polynomial pattern (MATLAB detrend) throughout your entire session. This preserved native sluggish sign modifications whereas correcting for photobleaching. Comparisons between mice have been carried out utilizing the z-scored d*F*/*F*.

Experimenters have been blind to group id through the preliminary levels of research when evaluation home windows have been decided and {custom} code was established to quantify fibre photometry alerts and behavioural measurements. Evaluation home windows have been chosen to seize the extent of imply phasic activations following every form of stimulus. For NAc–DA and VTA–DA, reward responses have been quantified from 0 to 2 s after reward supply and cue responses have been quantified from 0 to 1 s after cue supply. DS–DA exhibited a lot quicker kinetics, and thus reward and cue responses have been quantified from 0 to 0.75 s after supply.

Somatic Chr2 excitation was carried out with a 473-nm laser (50 mW, OEM Laser Techniques) coupled by a branching fibre patch wire (200 μm, Doric Lenses) to the VTA-implanted fibres utilizing ceramic mating sleeves. Burst activations of 30 Hz (10 ms on, 23 ms off) have been delivered with durations of both 150 ms for calibrated stimulation or 500 ms for big stimulations. For calibrated stimulation, laser energy was set between 1 and three mW (steady-state output) to provide a NAc–DA reactive of comparable amplitude to the biggest transients noticed through the first a number of trials of the session. This was confirmed publish hoc to have roughly doubled the scale of reward-related NAc–DA transients (Figs. 3a and 5b). For big stimulations, steady-state laser output was set to 10 mW.

### ACTR computational studying mannequin

#### Behavioural plant

An essential facet of this modelling work was to create a generative agent mannequin that will produce core points of reward-seeking behaviour in mice. To this finish, we targeted on licking, which within the context of this process is the distinctive facet of behaviour important for reward assortment. A reader could have a look at the perform dlRNN_Pcheck_transfer.m throughout the software program repository to understand the construction of the plant mannequin. We describe the perform of the plant briefly right here. It’s well-known that in consumptive, repetitive licking mice exhibit preparatory durations of about 7 Hz licking. We modelled a easy mounted charge plant with an lively, ‘lick’ state that emitted noticed licks at a hard and fast time interval of 150 ms. The onset of this lick sample relative to entry into the lick state was began at a variable section of the interval (common latency to lick initialization from transition into lick state about 100 ms). Stochastic transitions between ‘relaxation’ and ‘lick’ states have been ruled by ahead and backward transition charges. The reverse transition charge was a relentless that relied on the presence of reward (5 × 10^{−3} ms with out reward, 5 × 10^{−1} ms with reward). This alteration within the backwards charge captured the common period of consumptive licking bouts. The ahead charge was ruled by the scaled coverage community output and a background tendency to transition to licking as a perform of trial time (analogous to an exponential rising hazard perform; 𝜏 = 100 ms). The output unit of the coverage community was the sum of the RNN output unit (constrained {−1,1} by the tanh activation perform) and a big reactive transient proportional to the sensory weight ({0,max_scale}), during which max_scale was a free parameter usually bounded from 5 to 10 throughout initialization. This internet output was scaled by *S* = 0.02 ms^{−1} to transform to a scaled transition charge within the coverage output. Behaviour of the plant for a spread of insurance policies is illustrated in Prolonged Knowledge Fig. 2. A wide range of parameterizations have been explored with qualitatively comparable outcomes. Chosen parameters have been arrived at by scanning many alternative simulations and matching common preliminary and closing latencies for cue–reward pairings throughout the inhabitants of animals. Extra sophisticated variations (high-pass filtered, nonlinear scaling) of the transition from RNN output to transition charge could be explored within the offered code. Nonetheless, all transformations have been discovered to provide qualitatively comparable outcomes, and thus the best (scalar) transformation was chosen for reported simulations for readability of presentation.

#### RNN

As famous in the primary textual content, the RNN element of the mannequin and the training guidelines used for coaching drew on inspiration from ref. ^{36}, which itself drew on inspiration variants of node perturbation strategies^{61} and the basic coverage optimization strategies referred to as REINFORCE guidelines^{3,21}. Briefly, ref. ^{36} demonstrated {that a} comparatively easy studying rule that computed a nonlinear perform of the correlation between a change in enter and alter in output multiplied by the change in efficiency on the target was sufficiently correlated with the analytic gradient to permit environment friendly coaching of the RNN. We carried out a number of modifications relative to this prior work. Beneath we delve into the training rule as carried out right here or a reader could study the commented open supply code to get additional clarification as properly. First, we describe the construction of the RNN and a few core points of its perform within the context of the mannequin. The RNN was constructed largely as described in ref. ^{36}, and was very similar to the construction of a re-implementation of that mannequin in ref. ^{62}.

Though we explored a spread of parameters governing RNN development, many examples of that are proven in Prolonged Knowledge Fig. 2, the simulations proven in the primary outcomes come from a community with 50 items (*N*_{u} = 50; chosen for simulation effectivity; bigger networks have been explored extensively as properly), densely related (*P*_{c} = 0.9), spectral scaling to provide preparatory dynamics (** g** = 1.3), a attribute time fixed (𝝉 = 25 ms) and a normal tanh activation perform for particular person items. Preliminary inner weights of the community (

*W*

_{ij}) have been assigned in response to the equation (in RNN-dudlab-master-LearnDA.m)

$${W}_{ij}={boldsymbol{g}}instances {mathscr{N}}(0,1)instances {({P}_{{rm{c}}}instances {N}_{{rm{u}}})}^{-1/2}$$

(1)

The RNN had a single major output unit with exercise that constituted the continual time coverage (that’s, *π*(*t*)) enter to the behaviour plant (see above), and a ‘suggestions’ unit that didn’t undertaking again into the community as could be customary, however fairly was used to provide adaptive modifications within the studying charge (described in additional element within the part under entitled Studying guidelines).

#### Goal perform

Analysis of mannequin efficiency was calculated in response to an goal perform that defines the price because the efficiency value (equation (2), value_{P}) and an elective community stability value (equation (3), value_{N}) (for instance, traces 269 and 387 in dlRNN-train_learnDA.m, for equations (4) and (5), respectively)

$${{rm{value}}}_{{rm{P}}}={1-{rm{e}}}^{-Delta {rm{t}}/500}$$

(2)

$${{rm{value}}}_{{rm{N}}}={rm{sum}}({rm },{boldsymbol{delta }}{boldsymbol{pi }}(t)/{boldsymbol{delta }}t| )$$

(3)

$${R}_{{rm{obj}}}=(1-{{rm{value}}}_{{rm{P}}})-{boldsymbol{pi }}({t}_{{rm{reward}}})$$

(4)

$$langle R(T)rangle ={{boldsymbol{alpha }}}_{{rm{R}}}instances {R}_{{rm{obj}}}(T)+left(1-{{boldsymbol{alpha }}}_{{rm{R}}}proper)instances {R}_{{rm{obj}}}(T-1)$$

(5)

during which *T* is the trial index. In all introduced simulations, *W*_{N} = 0.25. A filtered common value, **R**, was computed as earlier than^{36} with **α**_{R} = 0.75 and used within the replace equation for altering community weights by way of the training rule described under. For all constants a spread of values have been tried with qualitatively comparable outcomes. The efficiency goal was outlined by value_{P}, for which ∆*t* is the latency to collected reward after it’s out there. The community stability value (value_{N}) penalizes high-frequency oscillatory dynamics that may emerge in some (however not all) simulations. Such oscillations are inconsistent with noticed dynamics of neural exercise to this point.

#### Figuring out properties of RNN required for optimum efficiency

To look at what properties of the RNN have been required for optimum efficiency, we scanned by way of hundreds of simulated community configurations (random initializations of *W*_{ij}) and ranked these networks in response to their imply value (*R*_{obj}) when run by way of the behaviour plant for 50 trials (an illustrative group of such simulations is proven in Prolonged Knowledge Fig. 2). This evaluation revealed a number of key points of the RNN required for optimality. First, a preparatory coverage that spans time from the detection of the cue by way of the supply of water reward minimizes latency value. Second, though optimum RNNs are comparatively detached to some parameters (for instance, *P*_{c}), they have an inclination to require a coupling coefficient (*g*) ≧ 1.2. This vary of values for the coupling coefficient is understood to find out the capability of an RNN to develop preparatory dynamics^{63}. According to this interpretation, our findings confirmed that optimum insurance policies have been noticed uniquely in RNNs with massive main eigenvalues (Prolonged Knowledge Fig. 2; that’s, long-time-constant dynamics^{64}). These analyses outline the optimum coverage as one which requires preparatory dynamics of output unit exercise that span the interval between the cue offset and reward supply and additional reveal that an RNN with long-timescale dynamics is required to comprehend such a coverage. Intuitively: preparatory anticipatory behaviour, or ‘conditioned responding’, optimizes reward assortment latency. If an agent is already licking when reward is delivered the latency to gather that reward is minimized.

#### RNN initialization for simulations

All mice examined in our experiments started coaching with no preparatory licking to cues and an extended latency (about 1 s or extra) to gather water rewards. This means that animal behaviour is according to an RNN initialization that has a coverage *π*(*t*) ≈ 0 for your entire trial. As famous above, there are various random initializations of the RNN that may produce clear preparatory behaviour and even optimum efficiency. Thus, we carried out massive searches of RNN initializations (random matrices *W*_{ij}) and used solely those who had roughly 0 common exercise within the output unit. We used a wide range of completely different initializations throughout the simulations reported (Fig. 1 and Prolonged Knowledge Fig. 2) and certainly there could be substantial variations within the noticed charge of convergence relying on preliminary situations (as there are throughout mice as properly). For simulations of particular person variations (Fig. 1j and Prolonged Knowledge Fig. 2), distinct community initializations have been chosen (as described above), and paired comparisons have been made for the management initialization and an initialization during which the weights of the inputs from the reward to the inner RNN items have been tripled.

#### Studying guidelines

Beneath we articulate how every facet of the mannequin acronym, ACTR (adaptive charge value of efficiency to REINFORCE), is mirrored within the studying rule that governs updates to the RNN. The connections between the variant of node perturbation used right here and REINFORCE^{21} has been mentioned intimately beforehand^{36}. There are two key courses of weight modifications ruled by distinct studying guidelines throughout the ACTR mannequin. First, we’ll talk about the training that governs modifications within the ‘inner’ weights of the RNN (*W*_{ij}). The thought of the rule is to make use of perturbations (1–10 Hz charge of perturbations in every unit; simulations reported used 3 Hz) to drive fluctuations in exercise and corresponding modifications within the output unit that might enhance or degrade efficiency. To resolve the temporal credit score task downside, we used eligibility traces much like these described beforehand^{36}. One distinction right here was that the eligibility hint decayed exponentially with a time fixed of 500 ms and it was unclear whether or not decay was a function of prior work. The eligibility hint (({mathcal{e}})) for a given connection *i*,*j* may very well be modified at any time level by computing a nonlinear perform (({mathcal{S}})) of the product of the by-product within the enter from the *i*th unit (({x})_{i}) and the output charge of the *j*th unit (*r*_{j}) within the RNN in response to the equation (in dlRNN_engine.m)

$${{mathcal{e}}}_{i,j}(t)={{mathcal{e}}}_{i,j}(t-1)+{boldsymbol{varphi }}[{{r}}_{j}(t-1)times ({{x}}_{i}(t)-langle {{x}}_{i}rangle )]$$

(6)

As famous in ref. ^{36}, the perform ({mathcal{S}}) want solely be a signed, nonlinear perform. Equally, in our simulations we additionally discovered {that a} vary of features might all be used. Sometimes, we used both ** ϕ**(

*y*) =

*y*

^{3}or

**(**

*ϕ**y*) = |

*y*|

*× y*, and simulations introduced have been usually the latter, which runs extra quickly.

The change in a connection weight (*W*_{ij}) within the RNN within the unique formulation^{36} is then computed because the product of the eligibility hint and the change in PE scaled by a studying charge parameter. Our implementation stored this core facet of the computation, however a number of important updates have been made and can be described. First, because the eligibility hint is believed to be ‘learn out’ right into a plastic change within the synapse by a phasic burst of dopamine firing^{58}, we selected to guage the eligibility on the time of the computed burst of dopamine exercise estimated from the exercise of the parallel suggestions unit (see under for additional particulars). Once more, fashions that don’t use this conference also can converge, however normally converge worse than and fewer equally to noticed mice. The replace equation is thus (for instance, line 330 in dlRNN-train_learnDA.m)

$${{rm{W}}}_{i,j}(T)={{rm{W}}}_{i,j}(T-1)+{{boldsymbol{beta }}}_{{rm{DA}}}instances {eta }_{{mathscr{S}}}instances {e}_{i,j}({t}_{{rm{DA}}})instances ({R}_{{rm{obj}}}(T)-langle R(T)rangle )$$

(7)

during which ({eta }_{{mathcal{S}}}) is the baseline studying charge parameter and is mostly used within the vary 5 × 10^{−4} ± 1 × 10^{−3} and *β*_{DA} is the ‘adaptive charge’ parameter that could be a nonlinear perform (sigmoid) of the sum of the by-product of the coverage on the time of reward plus the magnitude of the reactive response element plus a tonic exercise element, *T* (*T* = 1 besides in Prolonged Knowledge Fig. 2 the place famous and ** ϕ** is a sigmoid perform mapping inputs from {0,10} to {0,3} with parameters:

**σ**= 1.25,

**μ**= 7) (for instance, line 259 in dlRNN-train_learnDA.m):

$${{boldsymbol{beta }}}_{{rm{DA}}}=T+{boldsymbol{varphi }}(Delta {boldsymbol{pi }}({t}_{{rm{reward}}})+{{mathscr{S}}}_{i,{rm{reward}}})$$

(8)

As famous within the description of the behavioural information described in Fig. 1, it’s clear that animal behaviour reveals studying of each preparatory behavioural responses to the cue in addition to reactive studying that reduces response instances between sensory enter (both cues or rewards) and motor outputs. That is notably outstanding in early coaching throughout which a marked lower in reward assortment latency happens even within the absence of notably massive modifications within the preparatory element of behaviour. We interpreted this reactive element as a ‘direct’ sensorimotor transformation according to the therapy of response instances within the literature^{65}, and thus reactive studying updates weights between sensory inputs and the output unit (one particular ingredient of the RNN listed as ‘o’ under). This reactive studying was additionally up to date in response to PEs. Specifically, the distinction between *R*_{obj}(*T*) and the exercise of the output unit on the time of reward supply. For the cue, updates have been proportional to the distinction between the by-product within the output unit exercise on the cue and the PE on the reward supply. These charges have been additionally scaled by the identical 𝜷_{DA} adaptive studying charge parameter (for instance, line 346 in dlRNN-train_learnDA.m):

$${W}_{{rm{trans}},{rm{o}}}(T)={W}_{{rm{trans}},{rm{o}}}(T-1)+{{boldsymbol{beta }}}_{{rm{DA}}}instances {eta }_{{mathscr{S}}}instances ({R}_{{rm{obj}}}(T)-{boldsymbol{pi }}({t}_{{rm{reward}}}))$$

(9)

during which *η*_{I} is the baseline reactive studying charge and typical values have been about 0.02 in introduced simulations (once more a spread of various initializations have been examined).

We in contrast acquisition studying within the full ACTR mannequin to noticed mouse behaviour utilizing a wide range of approaches. We scanned about two orders of magnitude for 2 important parameters *η*_{I} and *η*_{W}. We additionally aimed to pattern the mannequin throughout a spread of initializations that roughly coated the vary of studying curves exhibited by management mice. To scan this area, we adopted the next process. We initialized 500–1,000 networks with random inner weights and preliminary sensory enter weights (as described above). As no mice that we noticed initially exhibited sustained licking, we chosen six community initializations with preparatory insurance policies roughly fixed and 0. For these 6 internet initializations, we ran 24 simulations with 4 situations for every initialization. Particularly, we simulated enter vectors with preliminary weights ({mathcal{S}}) = [0.1, 0.125, 0.15, 0.175] and baseline studying charges *η*_{I} = [2, 2.25, 2.5, 2.75] × 8 × 10^{−3}. Consultant curves of those simulations are proven in Fig. 1j.

#### Visualizing the target floor

To visualise the target floor that governs studying, we scanned a spread of insurance policies (mixtures of reactive and preparatory parts) handed by way of the behaviour plant. The vary of reactive parts coated was [0:1.1] and preparatory was [−0.25:1]. This vary corresponded to the area of all attainable coverage outputs realizable by the ACTR community. For every pair of values, a coverage was computed and handed by way of the behaviour plant 50 instances to get an estimate of the imply efficiency value. These simulations have been then fitted utilizing a third-order, two-dimensional polynomial (analogous to the process used for experimental information) and visualized as a three-dimensional floor.

Within the case of experimental information, the total distribution of particular person trial information factors throughout all mice (*N* = 7,200 observations) was used to suit a third-order, two-dimensional polynomial (MATLAB; match). Noticed trajectories of preparatory versus reactive have been superimposed on this floor by discovering the closest corresponding level on the fitted two-dimensional floor for the parametric preparatory and reactive trajectories. These information are introduced in Fig. 1j.

#### Simulating closed-loop stimulation of mDA experiments

We sought to develop an experimental check of the mannequin that was tractable (versus inferring the unobserved coverage for instance). The experimenter in precept has entry to real-time detection of licking through the cue–reward interval. In simulations, this can also simply be noticed by monitoring the output of the behavioural plant. Thus, within the mannequin we stored monitor of particular person trials and the variety of licks produced within the cue–reward interval. For evaluation experiments (Fig. 5e), we tracked these trials and individually calculated the anticipated dopamine responses relying on trial kind classification (lick– vs lick+). For simulations in Fig. 5e, we ran simulations from the identical initialization in 9 replicates (matched to the variety of management mice) and error bars mirror the usual error.

To simulate calibrated stimulation of mDA neurons, we multiplied the adaptive charge parameter, *β*_{DA}, by 2 on the suitable trials For simulations reported in Fig. 5e, we used three situations: management, stimLick– and stimLick+. For every of those three situations, we ran 9 simulations (3 completely different initializations, 3 replicates) for 27 whole studying simulations (800 trials). This alternative was an try and estimate the anticipated experimental variance as trial classification scheme is an imperfect estimate of underlying coverage.

#### Pseudocode abstract of mannequin

Right here we offer an outline of how the mannequin features in pseudocode to enhance the graphical diagrams in the primary figures and the discursive descriptions of particular person parts which might be used under.

Initialize trial to *T* = 0

Initialize ACTR with *W*(0), ({mathcal{S}})_{rew}(*T*), ({mathcal{S}})_{cue}(*T*)

repeat

Run RNN simulation engine for trial *T*

Compute plant enter ** π**(

*T*) =

*O*(

*T*) + ({mathcal{S}})(

*T*)

Compute lick output *L*(*t*) = Plant(** π**(

*T*))

Compute latency to gather reward *t*_{acquire} ← discover *L*(*t*) > *t*_{reward}

Compute value(*T*) = 1 −exp(−∆*t*/500)

Consider eligibility hint at assortment ({mathcal{e}}) ← ({mathcal{e}})_{i,j}(*t*_{acquire})

Compute *β*_{DA} = 1 + ** ϕ**(∆

**(**

*π**t*

_{reward}) + ({mathcal{S}})

_{rew})

Compute *R*_{obj}(*T*) = 1 − (1 − exp(−∆*t*/500)) − *O*(*T*, *t*_{reward} − 1)

Estimate goal gradient PE = *R*_{obj}(*T*) − ⟨*R*(*T*)⟩

Compute replace ∆*W* = − *η*_{J} × ({mathcal{e}})× PE × *β*_{DA}

Replace *W*(*T* + 1) ← *W*(*T*) + ∆*W*

Replace ({mathcal{S}})_{reward}(*T* + 1) ← ({mathcal{S}})_{rew}(*T*) + ({eta }_{{mathcal{S}}}) × *R*_{obj}(*T*) × *β*_{DA}

Replace ({mathcal{S}})_{cue}(*T* + 1) ← ({mathcal{S}})_{cue}(*T*) + ({eta }_{{mathcal{S}}})× *R*_{obj}(*T*) × *β*_{DA}

Till *T* == 800

during which *T* is the present trial and* t* is time inside a trial, *W* is the RNN connection weight matrix, ({mathcal{S}}) is the sensory enter power, *O* is the RNN output, *π* is the behavioural coverage, ∆*t* = *t*_{acquire} − *t*_{reward}, ** ϕ** is the nonlinear (sigmoid) remodel, ⟨

*R*(

*T*)⟩ is the operating imply PE,

*η*

_{J}is the baseline studying charge for

*W*and ({eta }_{{mathcal{S}}}) is the baseline studying charge for enter ({mathcal{S}}).

#### ACTR mannequin variants

In Fig. 1k, we think about three mannequin variants equal to dopamine signalling PEs, dopamine depletion and lack of phasic dopamine exercise—all manipulations which were printed within the literature. To perform these simulations, we: modified *β*_{DA} to equal PE; modified *β*_{DA} offset to 0.1 from 1; and altered *β*_{DA} to equal 1 and eliminated the adaptive time period.

In Figs. 3 and 5, calibrated stimulation was modelled as setting *β*_{DA} to double the maximal attainable magnitude of *β*_{DA} beneath regular studying. In Figs. 3c–e and 5i, we modelled uncalibrated dopamine stimulation as setting PE = +1 along with the calibrated stimulation impact.

#### TD studying mannequin

To mannequin a normal TD worth studying mannequin we reimplemented a beforehand printed mannequin that spanned a spread of mannequin parameterizations from ref. ^{66}.

#### Coverage studying mannequin equal to the low-parameter TD studying mannequin

The ACTR mannequin that we articulate seeks to offer a believable mechanistic account of naive hint conditioning studying utilizing: RNNs; a biologically believable synaptic plasticity rule; conceptually correct circuit group of mDA neurons; a ‘plant’ to regulate real looking behaviour; and a number of parts of processing of sensory cues and rewards. Nonetheless, to facilitate formal comparability between worth studying and direct coverage studying fashions, we sought to develop a simplified mannequin that captures a key facet of ACTR (the precise gradient it makes use of) and permits for express comparability in opposition to present worth studying fashions with the identical variety of free parameters. To mannequin a low-parameter (as in comparison with ACTR) coverage studying equal of the TD worth studying mannequin from ref. ^{67}, we used the identical core construction, foundation perform illustration and free parameters. Nonetheless, fairly than utilizing an RPE (worth gradient) for updating, we observe earlier work^{32} and think about a direct coverage studying model during which a coverage gradient is used for updates as initially described in ref. ^{21} and equal when it comes to the efficient gradient to the ACTR implementation. First, we think about the latency to gather reward fairly than the reward worth per se as utilized in TD fashions. The latency to gather reward is a monotonic perform of the underlying coverage such that elevated coverage results in elevated anticipatory licking as a discount within the assortment latency (Fig. 1). Sometimes one makes use of a nonlinearity that saturates in direction of the boundaries 0,1. For simplicity, we select a smooth nonlinearity (half-Gaussian) for comfort of the easy coverage gradient that outcomes. Whatever the scaling parameter of the Gaussian (sigma), the by-product of the log of the coverage is then proportional to 1 − *p*_{t}, during which *p*_{t} is the coverage on trial *t* (topic to scaling by a relentless proportional to sigma that’s subsumed right into a studying charge time period within the replace equation). In line with the REINFORCE algorithm household^{21}, we have now an replace perform proportional to (*r*_{curr} − *b*) × (1 − *p*_{t}), during which *r*_{curr} is the present trial reward assortment latency and *b* is a neighborhood common of the latency calculated by *b* = ** υ** ×

*r*

_{curr}+ (1 −

**) ×**

*υ**b*. Typical values for

**υ**have been 0.25 (though a spread of various calculations for

*b*, together with

*b*= 0, yield constant outcomes as famous beforehand

^{21}).

#### Formal mannequin comparability

As in earlier work^{32}, we sought to match the relative probability of the noticed information beneath the optimum parameterization of both the worth studying (TD) mannequin or the direct coverage studying mannequin. The information we aimed to guage have been the frequency of anticipatory licking through the delay interval over the primary roughly 1,000 trials of naive studying for every mouse. We used a latest mannequin formalization proposed to explain naive studying^{67} and used grid search to search out optimum values of the parameters **λ**, **α** and **γ**. To compute the chance of observing a given quantity of anticipatory licking as a perform of the worth perform or coverage, respectively, we used a traditional chance density (sigma = 1) centred on the anticipated lick frequency (7 Hz × worth or coverage). Preliminary examination revealed that sigma = 1 minimized the LL for all fashions, however the tendencies have been the identical throughout a spread of sigma. The −LL of a given parameterization of the mannequin was computed because the unfavorable sum of log possibilities over trials for all mixtures of free parameters. We additionally computed the Akaike data criterion^{68}—sum of ln(sum(residuals^{2}))—as most well-liked in some earlier work^{69}. The outcomes have been constant and the variety of free parameters was equal; thus, we primarily report −LL within the manuscript. For direct comparability, we took the minimal of the −LL for every mannequin (that’s, its optimum parameterization) and in contrast these minima throughout all animals. To look at the ‘brittleness’ of the mannequin match, we examine the median −LL throughout your entire grid search parameter area for every mannequin.

#### Estimating PEs from behavioural information

First, we assume that on common the variety of anticipatory licks is an unbiased estimate of the underlying coverage (the core assumption of the low-parameter fashions described above). The latency to gather reward could be transformed right into a efficiency value utilizing the identical equation (2) described for ACTR. The PE was then computed as in equation (4). A smoothed baseline estimate was calculated by smoothing PE with a 3-order, 41-trial-wide Savitzky–Golay filter and the baseline subtracted PE calculated analogous to equations (4) and (5).

### Histology

Mice have been killed by anaesthetic overdose (isoflurane, >3%) and perfused with ice-cold phosphate-buffered saline, adopted by paraformaldehyde (4% wt/vol in phosphate-buffered saline). Brains have been post-fixed for two h at 4 °C after which rinsed in saline. Complete brains have been then sectioned (100 μm thickness) utilizing a vibrating microtome (VT-1200, Leica Microsystems). Fibre tip positions have been estimated by referencing customary mouse mind coordinates^{70}.

### Statistical evaluation

Two-sample, unpaired comparisons have been made utilizing Wilcoxon’s rank sum check (MATLAB rank sum); paired comparisons utilizing Wilcoxon signed rank check (MATLAB signrank). A number of comparisons with repeated measures have been made utilizing Friedman’s check (MATLAB friedman). Comparisons between teams throughout coaching have been made utilizing two-way ANOVA (MATLAB anova2). Correlations have been quantified utilizing Pearson’s correlation coefficient (MATLAB corr). Linear regression to estimate contribution of fibre place to variance in mDA reward alerts was fitted utilizing MATLAB fitlm. Polynomial regression used to suit goal surfaces have been third order and (MATLAB match). Errors are reported as s.e.m. Pattern sizes (*n*) discuss with organic, not technical, replicates. No statistical strategies have been used to predetermine pattern measurement. Knowledge visualizations have been created in MATLAB or GraphPad Prism.

### Reporting abstract

Additional data on analysis design is accessible within the Nature Portfolio Reporting Abstract linked to this text.