Skip to main content

Pair programming with AI

AI tools such as ChatGPT and GitHub Copilot are revolutionizing the way we write code.

Josh Puetz

Software Engineer - Donut

Schedule Entry

Attendees

  • Reed
RelevancyInteresting
93

Notes

ChatGPT

Large Language Model (LLM) Tokens

  • Cat
  • Dog
  • Awake
  • asleep Relationship between tokens
  • Cat
  • Objects: (cat) -> (dog)
  • Language: (cat) -> (katze)
  • Concepts: (cat is awake) -> (cat is sleeping)
  • More complicated concepts: (cat is immature) -> (cat is mature)

Atom model is like a heatmap of relationship between position of electron and proton LLM is like the atom model for probability space of relationships between tokens Generative Pre-trained Transformer (GPT) A generative model is a function that can take a structured relationship of tokens and output a structured relationship of tokens. Exchanges with GhatGPT:

Me: I just got a new dog!

Me: I just got a new dog! ChatGPT: Congratulations! Me: what’s a good name?

Entire conversation is appended to each interaction.

What is it good for?

  • Planning
  • Code generation
  • Code explanation

Explaining code: FizzBuzz

  • Gave line-by-line explanation

Refactoring code: FizzBuzz in one line

  • Wrote a long line that went off the screen

Writing tests: wrote a passable RSpec test. Limitations:

  • Incorrect code - no glaring errors
  • Unoptimized code - not really efficient
  • Expensive

Github Copilot

Tokens are language syntax. Makes code suggestions. It can suggest an entire class if you prompt it with class Post. Doesn’t write great tests on its own. It needs some prompting. Writing comments is hit-or-miss. Limitations are similar to ChatGPT.

  • Smaller suggestion scope than ChatGPT.
  • Copilot generally only tries to complete the line you are writing.
  • Limited IDEs
  • Limited languages
  • GitHub lock-in
  • Users accepted on average 26% of all completions shown by Copilot

Where do they get training data?

  • Commoncrawl.org data
  • WebText2
  • Books1 and Books2 book data sets
  • Wikipedia
  • 159 GB of Python code from 54 million GitHub repos
  • You (and your input)

Training data sources are opaque. StackOverflow banned submissions. Bias. Duty to disclose. Liability. IP. Licensing. Best Practices

  • Review code for errors
  • Understand how the code works
  • Be aware of legal and ethical concerns
  • Disclose your usage
  • Opt out of usage

References: