Optimizing Quantum Circuit Synthesis Through Reinforcement Learning

In the second part of our series, we delve into our research team's efforts to apply reinforcement learning (RL) for state preparation in quantum circuits. In the first part, we laid the groundwork by discussing fundamental concepts crucial to quantum computation.

Introduction

We find ourselves in a period where quantum computing is restricted to noisy intermediate-scale quantum computers (NISQ). This emerging technology, akin to the initial discovery that striking stones produces sparks, holds the promise of transformative innovations.

Presently, these devices are hindered by a limited number of qubits and short coherence times, making it essential to maximize the potential of NISQ technology. The goal is to find the optimal configuration of quantum gate operations to achieve specific outcomes. However, identifying this configuration poses significant challenges, particularly in executing valuable quantum computations.

Leveraging advancements in machine learning offers a viable path to capitalize on these initial sparks of quantum technology and realize practical benefits.

Reinforcement Learning

Before diving deeper, let's briefly explore the concept of reinforcement learning (RL). It gained popularity in the AI community through notable achievements such as OpenAI’s DOTA2 team, which triumphed over professional players, and Google’s AlphaZero, which defeated Go champions Lee Sedol and Ke Jie.

While RL is often perceived as a technique primarily for mastering video games, we aspire to demonstrate its applicability in quantum computing, a field seemingly outside the gaming realm.

The core principle of reinforcement learning involves training an agent (represented by a policy) to achieve a defined goal. This process can be understood through two key components: the agent and the environment.

The environment in RL represents the simulated world where the agent operates. The agent interacts with this environment, learning similarly to how humans acquire knowledge through experiences.

At each time step (or ‘episode’), the agent observes its surroundings, either partially or fully, to ascertain its current state. It then selects an action from a predefined set, which alters the environment’s state and returns a reward based on the change from the previous state.

The agent’s decision-making process is guided by its policy — a probabilistic model that outputs a valid action based on the observed state. In deep reinforcement learning, this policy is often realized through a neural network, which takes the current state as input and generates an action as output.

A simplistic approach might be to opt for the action yielding the highest immediate reward, but this can lead the agent into undesirable situations or local optima. Instead, considering future rewards alongside immediate gains can help the agent learn how to reach its ultimate objective.

After completing an episode, the policy is updated with information about states, actions, and rewards, helping the agent discern which actions are preferable and which to avoid.

One of the major challenges in creating effective RL agents is designing a reward function that accurately aligns with the intended use case. This function acts as the agent's sole supervisor within its environment, providing rewards that guide the agent toward its goals while steering clear of pitfalls.

The RL research landscape is rife with amusing examples of misaligned agents — scenarios that seem humorous from our vantage point but make perfect sense to the agent. For instance:

“I connected a neural network to my Roomba to help it navigate without bumping into objects, but it learned to drive in reverse since there were no bumpers there.”
“Agent sacrifices itself at level 1 to avoid losing at level 2.”
“Agent pauses the game indefinitely to prevent losing.”
In a boat racing game, “the RL agent discovers a secluded lagoon where it can continuously knock over three targets, instead of actually finishing the race.”

For a growing collection of literature on reinforcement learning applied in games, including some amusing cases of flawed reward functions, check here. With the foundational understanding of quantum computing and reinforcement learning established, we are now prepared to detail our approach to preparing quantum states using RL.

Our Approach

Equipped with the capability to utilize reinforcement learning for optimizing quantum circuit construction, we began investigating action sets, state definitions, potential reward functions, and the RL algorithms to employ.

We utilized IBMQ's Qiskit Python library for constructing and visualizing quantum circuits and executing relevant quantum operations, in conjunction with the Qutip library. TensorFlow was employed to implement neural networks that formed the basis for the agents' policies.

After reviewing current literature on RL algorithms and assessing their performances, we decided to implement Proximal Policy Optimization (PPO) due to its widespread success and applicability. Other options we considered included Projective Simulation, Deep Q Networks, and Asynchronous Actor-Critic Agents (A3C).

Our initial strategy was to use the quantum state of specific qubits as the environmental state and define an action space containing all permissible qubit-gate pairs (e.g., applying an X gate to qubit one as a distinct action from applying it to qubit three).

The first reward function was based solely on fidelity, a measure of the ability to distinguish between two quantum states. Essentially, the greater the overlap between the current state and the target state, the higher the reward. Additionally, to promote the synthesis of smaller circuits, the reward was reduced for each gate employed.

Although fidelity was effective in determining goal achievement, we quickly realized it was a flawed reward function. This flaw arose from the existence of numerous local maxima within the fidelity function across the discrete gate space. Particularly for simpler circuits, a short-term increase in fidelity often resulted in a higher overall gate count, sometimes leading the circuit to a dead-end.

Our second attempt at a reward function aimed to include a measure of the remaining gates needed to complete the circuit. Two papers stood out in addressing this issue. One (Nielsen et al.) utilized geometric principles to portray the space of quantum operations as a surface, where the optimal circuit represented the shortest path across this surface. The other (Girolami) derived a lower bound for the number of gates required to transition qubits from one state to another.

Unfortunately, both approaches had limitations that hindered their application in our framework. Generating and resolving geodesics proved computationally intensive, while Girolami's lower bound was only validated for commuting gates, restricting the universality of the actions.

We tentatively addressed the reward function challenge by introducing a sparse reward scheme, which only provided a positive reward (inversely related to the number of gates utilized) when the final circuit met the goal. At this experimentation stage, we modified the action space to include only distinct gate operations and configured the agent to loop over the qubits, rather than allowing independent selections.

As a result of this adjustment, we included the identity gate in the action space so the agent could 'skip' a qubit if deemed necessary, ensuring consistent operations across different qubits.

The outcomes of these modifications were encouraging, with the model consistently able to construct optimal simple three-qubit circuits within a reasonable timeframe (e.g., the circuit below was synthesized in approximately 40 training epochs, or 40 seconds on the testing laptop).

Buoyed by our agent's performance, we sought to assess the model's generalizability. We developed a curriculum of various goal states, ranging from simple one-gate unitary goals to more intricate four-gate unitary targets. Evidence of improvement was clear as the agent trained across different goal sets multiple times, indicating its ability to learn a policy that accounts for multiple goals simultaneously.

The effectiveness of PPO without a guiding reward function indicated promising results, particularly after hyper-parameter optimization, which significantly shortened convergence times (within five seconds for the simplest circuits and minutes later in the curriculum). This enhancement in convergence with pre-trained models strongly suggested that our neural network could effectively navigate the complex quantum circuit environment.

Consequently, we decided to experiment with other reward functions, replacing our non-commuting gate set with the commuting IQP gate set. We then utilized Girolami’s lower bound to formulate our new reward function. This model built upon the success of our sparse reward while guiding the agent toward the goal, hopefully avoiding local maxima.

The implementation of a step-by-step reward function demonstrated relatively rapid convergence, even for circuits more complex than those used to generate the previous results. The success of this commuting-gate reward function in directing the agent toward goals with swift convergence times represents a robust proof of concept. With the function adapted for non-commuting gates, a broader policy could potentially be developed.

Our journey in training a reinforcement learning agent to synthesize optimal quantum circuits hinged on creating a custom environment for the agent to explore. This environment, which incorporates features like support for various gate sets, consideration of qubit connectivity, and generation of circuits of different sizes, can be found here. All experiments conducted throughout this series utilized this environment.

We believe these results bolster ongoing efforts to discover the optimal quantum circuit construction, often referred to as the holy grail in this field. Our findings suggest that the reward function, serving as a reliable metric for the distance between the current state and the target state, is a crucial component in applying machine learning to the efficient design of quantum circuits. Further enhancements to the cost function could significantly improve our agent’s performance.

We are currently exploring exciting avenues related to this research, including extending this reinforcement learning approach to a continuous control space — shaping the precise pulses used to implement quantum gates in circuits. Stay tuned for more updates!

zhaopinboai.com

Optimizing Quantum Circuit Synthesis Through Reinforcement Learning

Share the page:

Recent Post:

Why You Might Want to Avoid Grammarly for Your Writing

Unlocking AI's Language: 23 Buzzwords for 2023 Explained

Accelerating AI Adoption in 2024: Key Insights from The Economist

Sustainable NeuroLab: Advancing Neuroscience with Eco-Friendly Practices

Mastering Python Objects: A Comprehensive Beginner's Guide

The Pros and Cons of Blockchain Technology: A Comprehensive Overview

The Marvelous Role of Proteins in Our World

Exploring the Two Envelopes Paradox: A Mathematical Dilemma