Import AI 280: Why bigger is worse for RL; AI-generated Pokemon; real-world EfficientNet #AI

#A.I.

Use an AI to generate a Pokemon in two (2!) clicks:
Here’s a fun Colab notebook from Max Woolf (@minimaxir) that lets you use AI to dream up some Pokemon in a couple of clicks (and with a few minutes of waiting). This isn’t remarkable – in recent years, AI generation stuff has got pretty good. What is remarkable is the usability. Two clicks! A few years ago you’d need to do all kinds of bullshit to get this to work – download some models on GitHub, get it to run in your local environment, make sure your versions of TF or PyTorch are compatible, etc. Now you just click some buttons and a load of stuff happens in the browser then, kabam, hallucinated pokemon.

Things that make you go ‘hmmm’: This tech is based on ruDALL-E, an open source Russian version of OpenAI’s ‘DALL-E’ network.
I think we’ve all rapidly got used to this. This is not normal! It is surprising and exciting!
Check out theColab notebook here (Google Colab).
Follow Max on Twitter here and thank him for making this cool tool!

####################################################

Uh-oh: The bigger your RL model, the more likely it is to seek proxy rather than real rewards:
…Think RL gets better as you scale-up models? Hahahah! NOT AT ALL!…
In the past couple of years, big models have become really useful for things ranging from text processing to computer vision to, more recently, reinforcement learning. But these models have a common problem – as you scale up the size of the model, it’s good capabilities get better, but so do its bad ones.
For example, if you increase the size of a language model, it’ll generate more toxic text (rather than less), without interventions (see: A General Language Assistant as a Laboratory for Alignment). New research from Caltech and UC Berkeley shows how this same phenomena shows up in reinforcement learning agents, as well. In tests across a few distinct RL domains, they find that “As model size increases, the proxy reward increases but the true reward decreases. This suggests that reward designers will likely need to take greater care to specify reward functions accurately and is especially salient given the recent trends towards larger and larger models”

What they did: They tested out a few different reinforcement learning agents on four different environments – an Atari game called Riverraid, a glucose monitoring system, a traffic control simulation, and a COVID model where the RL dials up and down social distancing measures. In all cases they found that ” model’s optimization power often hurts performance on the true reward”,

What can we do? Most of this behavior relates to objective design – give an AI the wrong objective function, and it’ll optimize its way to success there, while ignoring side effects (e.g, if you reward an AI for reducing rate of defects on a factory production line to zero, it might just work out how to stop the factory line and therefore eliminate all defects – along with your business). One way to do this is to have a baseline policy that humans have verified as having the right goal, then building some software to spot deltas between the RL policy and the idealized baseline policy.
This kind of works – in tests, the detectors can get anywhere between 45% and 81% accuracy at detecting anomalous from non-anomalous behaviors. But it certainly doesn’t work well enough to make it easy to deploy this stuff confidently. “Our results show that trend extrapolation alone is not enough to ensure the safety of ML systems,” they write. “To complement trend extrapolation, we need better interpretability methods to identify emergent model behaviors early on, before they dominate performance”.
Read more: THE EFFECTS OF REWARD MISSPECIFICATION: MAPPING AND MITIGATING MISALIGNED MODELS (arXiv).

####################################################

SCROLLS: A new way to test how well AI systems can understand big chunks of text:
…Now that AIs can write short stories, can we get them to understand books?…
Researchers with Tel-Aviv University, Allen Institute for AI, IBM Research, and Meta AI, have built ‘SCROLLS’ a way to test how well AI systems can reason about long texts. SCROLLs incorporates tasks ranging from summarization, to question answering, and natural language inference, as well as multiple distinct domains including transcripts, TV shows, and scientific articles. “Our experiments indicate that SCROLLS poses a formidable challenge for these models, leaving much room for the research community to improve upon,” the authors write.

How SCROLLs works: This benchmark has mostly been created via curation,consisting of 7 datasets that reward models that can contextualize across different sections of the datasets and process long-range dependencies.

The datasets: SCROLLS incorporates GovReport (summarization of reports addressing various national policy issues), SummScreenFD (summarization of TV shows, like Game of Thrones), QMSum (summarization of meeting transcripts), Qasper (question answering over NLP papers), NarrativeQA (question answering about entire books from Project Gutenberg), QuALITY (multiple choice question answering about stories from Project Gutenberg), and Contract NLI (natural language inference dataset in the legal domain).

How hard is SCROLLS? The authors test out two smart baselines (BART, and a Longformer Encoder-Decoder (LED)), and one dumb baseline (a basic pre-written heuristic). Based on the results, this seems like a really challenging task – a LED baseline with a 16384-token input length gets okay results, though BART gets close to it despite being limited to 1,024 tokens. This suggests two things: a) BART is nicely optimized, and b) it’s not entirely clear the tasks in scrolls truly test for long-context reasoning. “Our experiments highlight the importance of measuring not only whether an architecture can efficiently process a long language sequence, but also whether it can effectively model longrange dependencies,” they write.

Why this matters: “Contemporary, off-the-shelf models struggle with these tasks”, the researchers write. In recent years, many machine learning benchmarks have been saturated within months of being released; how valuable SCROLLS turns out to be will be a combination of its hardness and its longevity. If SCROLLS gets solved soon, that’d indicate that AI systems are getting much better at reasoning about long-range information – or it could mean the SCROLL tasks are bugged and the AI systems have found a hack to get a decent score. Pay attention to the SCROLLS leaderboard to watch progress here.
Read more: SCROLLS: Standardized CompaRison Over Long Language Sequences (arXiv).
Check out the leaderboard here.

####################################################

EfficientNet: Surprisingly good for solar panel identification:
…UC Berkeley project shows how easy fine-tuning is…
Some UC BErkeley researchers have built a small, efficient model for detecting solar panels. Their system, HyperionSolarNet, is an EfficientNet-B7 model finetuned from ImageNet onto a collection of 1,983 satellite images of buildings, labeled with whether they have solar panels or not. The resulting model gets an aggregate precision of 0.96 (though with lower accuracy for labeling the presence of a solar panel, indicating a propensity for false positives) when evaluated on a held-out test set.

Why this matters: Last week, we wrote about how you can build a classifier from scratch and beat a finetuning approach. This paper shows that finetuning can also work quite well for specific use-cases. It also, implicitly, highlights how fine-tuning has gone from something of an arcane science to something pretty reliable and well understood, forecasting a future where there are as many classifiers in the world as there are things to classify.
Read more:HyperionSolarNet: Solar Panel Detection from Aerial Images (arXiv).

####################################################

Tech Tales:
The Last Things
[A morgue in Detroit, 2035]

“When someone dies and gasps, are they just trying to get the last gasp of being alive?” asked the robot.

The morgue manager stared at the corpse, then at the robot. “I don’t know,” he said. “That’s a good question”.

“And when they know they are going to die, how do they save their information?” asked the robot.

“For example, I would send a zip of my stored data, as well as a copy of my cortical model, to a repository, if I knew I was about to be decommissioned or was in danger,” asked the robot.

“Most people don’t bother,” said the morgue manager. “My mother, for instance. When she was dying I asked her to write down some of her memories for me and my family, but she didn’t want to.”

“Why?”

“I think she was mostly concerned with experiencing her life, since she knew it was ending. She took trips while she was still mobile. Then, towards the end, she focused on eating her favorite foods and seeing her friends.”

“And did you learn anything about life from seeing her die,” asked the robot?

“Not particularly,” said the morgue manager. “Besides that life seems to become more valuable, the less you know you have of it.”

Things that inspired this story: A long conversation with someone who worked as a crisis therapist about the nature of death and belief; thinking about the differences between how real and synthetic intelligences may approach the concept of death.

via https://AIupNow.com

Jack Clark, Khareem Sudlow

Breaking

Monday, January 17, 2022

Import AI 280: Why bigger is worse for RL; AI-generated Pokemon; real-world EfficientNet #AI

Author Details

Fresh Beats Added Daily!

Facebook

Microsoft

Amazon

Apple

Instrumentals to Code to

Virtual Reality

Archive

Tags

What is A.I. Up to Now?

Connect with us

Trending

Contact Form

Contact