Architecture document

Overview

The GLADOS backend is written in Python and have direct access to the Kubernetes (k8s) infrastructure. The primary function of the backend is to manage experiments through its Flask endpoints.

The facility that runs the users' experiments are called Runners. The Runners are spawned as a k8s Job, a type of pod that runs a specific command within its container(s) and self terminate once the command finishes execution.

Components map

graph TD

    subgraph external["Externals"]
        mail["Google Mail"]
        auth_g["Google Auth"]
        auth_git["Github OAuth"]
    end

    subgraph k8s["Kubernetes (Internal)"]
        runner["Runner"]

        subgraph db["Database"]
            mongo["MongoDB"]
        end

        subgraph frontend["Frontend"]
            web["Browser Interface"] --> nextJs["Next.js server"]
            cli["Command Line Interface"] --> nextJs
            nextJs <-->|"Auth + Experiment data"| mongo
        end

        subgraph backend["Backend"]
            routes["Flask server (app.py)"] <-->|"Experiment data"| mongo
            routes --> spawner["Job spawner (spawn_runner.py)"]
        end
        nextJs -->|"Exp start request"| routes
        runner <-->|"Experiment data"| routes
        runner -->|"spawned by"| spawner
    end
    style k8s fill:none, fill:#0000ff,fill-opacity:0.1

    runner -->|"send mail"| mail
    nextJs -->|"Auth"| auth_g
    nextJs -->|"Auth"| auth_git

Data flow narrative: Running a successful experiment

0. Naming Convention

Since architecture does not have to be dependent on implementation details, the narrative will use the following naming scheme to describe components within the system: - Frontend -> Next.js server, handles general application logic (auth, uploads, etc...) - Backend -> Flask server, handles core business logic - Database -> MongoDB, handles all data storage - Runner -> Virtual container, run users' code and collect results

1. Initial Artifact Upload

An authenticated user wish to start an experiment. From the browser or command line interface, the user first uploads the code that they want to run by making a request to the Frontend server.

The Frontend will return a file id of the uploaded artifact to be used for experiment declaration. Before actually uploading, the Frontend perform a hash check with the Database to avoid keeping copies of the same code artifacts. If the artifact already exists and owned by the user, the server will reuse the existing file id instead of uploading the same copy of the artifact.

sequenceDiagram
    actor User as User (Browser / CLI)
    participant Frontend
    participant Database

    User->>Frontend: Upload artifact (code)

    Frontend->>Database: Hash check (artifact hash + user ID)

    alt Artifact exists and is owned by user
        Database-->>Frontend: Return existing file ID
    else Artifact not found or not owned by user
        Database-->>Frontend: No match
        Frontend->>Database: Store new artifact
        Database-->>Frontend: Return new file ID
    end

    Frontend-->>User: Return file ID

2. Dispatching Experiment

After the user declared the experiment and proceed with dispatch, the Frontend server will upload the experiment declacration to Database, then call on the Backend to start the experiment.

sequenceDiagram
    actor User as User (Browser / CLI)
    participant Frontend
    participant Database
    participant Backend

    User->>Frontend: Dispatch experiment (declaration + file ID)
    Frontend->>Database: Store experiment declaration
    Database-->>Frontend: Acknowledge
    Frontend->>Backend: Start experiment
    Backend-->>Frontend: Acknowledge
    Frontend-->>User: Experiment dispatched

3. Running the Experiment and Collect Result

The Backend, upon receiving the request to start the experiment, will spawn a child process to create the Runner and give it the experiment id to be run.

The Runner, upon starting, uses the experiment id to request the full experiment data from the Backend (code artifacts and experiment declaration). Next, The Runner will perform the necessary setup steps, and run the code on all possible permutations of hyperparameters that the user specified. Througout the duration of the Runner lifespan, it periodically send updates to Backend, which forwards it to Database. Frontend is actively watching for these changes in order to deliver live experiments updates.

When the Runner finishes running all hyperparameters permutations it will package the result, and send it to Backend along with a final update indicating that the experiment has concluded. Backend will write the data to Database, which Frontend will automatically pull. Finally, the Runner will send an email to the user notifying them of the experiment completion via the Google mail API

sequenceDiagram
    actor Mail as User's Inbox
    participant MailApi as Google Mail API
    participant Runner
    participant Backend
    participant Database
    participant Frontend
    actor User as User (Browser / CLI)

    Backend->>Runner: Spawn with experiment ID

    Runner->>Backend: Request experiment data (experiment ID)
    Backend->>Database: Pull experiment data
    Database-->>Backend: Return artifacts + declaration
    Backend-->>Runner: Return artifacts + declaration

    Note over Runner: Setup

    loop For Each hyperparameter permutation
        Note over Runner: Execute code
        Runner->>Backend: Periodic update
        Backend->>Database: Forward update
        User->>Frontend: Request update
        Frontend->>Database: Poll for changes
        Database-->>Frontend: Return update
        Frontend-->>User: Return update 
    end


    Runner->>Backend: Final result + experiment concluded
    Backend->>Database: Write final result
    Runner->>MailApi: Send Mail Request
    MailApi->>Mail: Send Mail
    Note over Runner: Terminate


    User->>Frontend: Request update
    Frontend->>Database: Poll for changes
    Database-->>Frontend: Return final result
    Frontend-->>User: Return update