October 12, 2022

Marvik Digest #5

By Natalia Cohn

Last month we covered some interesting stories involving multimodal transformers, stable diffusion, multilingual language models and more.ย 

Hugging Faceโ€™s new multimodal Transformer model

Great news to hear that the #TF version of #LayoutMv3 multimodal #Transformer model is now available on Hugging Face! ๐Ÿš€

Its simple yet revolutionary architecture improved many benchmarks from its predecessors, by being the first Document AI model which does not rely on a CNN or R-CNN backbone to extract visual features.ย 

๐ŸŸข Main highlights:

๐Ÿ“Œ One of its biggest advantages is that it is a general-purpose model for both text-centric and image-centric Document AIย 

๐Ÿ“Œ It unifies the concept of transformers for text centric purposes with the OCR & visual-centric models used for AI tasksย 

At Marvik, we have used this model for object detection related tasks and it has yielded amazing results ๐Ÿคฉ

โžก๏ธ To access the model: https://bit.ly/3CXm7B8

Stability AI.’s Stable diffusion

Give me โ€œA corgi with sunglasses driving a teslaโ€ and you getโ€ฆ ๐Ÿค”

Generative AI has come a long way. The introduction of #GANs allowed to reach new heights in the #ML space, but a new development is set to power the next generation of #AI imagen generation.

๐Ÿš€ We are talking about Stability AI.’s Stable diffusion ๐Ÿš€

How does this one differ from the other Diffusion models like #GLIDE, #DALLยทE 2 (OpenAI), #Imagen (Google)?

๐Ÿ“Œ Truly free and open source, both models and code

๐Ÿ“Œ Using latent diffusion, the model can be run with a consumer #GPU or even on an m1 chip

This means we can all finally use this powerful technique in our projects and play as much as we want with the amazing capabilities it offers, such as:

๐Ÿ“Œ Text to image generation (similar to #DALLยทE)

๐Ÿ“Œ Super resolution (#Denoising)

๐Ÿ“Œ Imagen in-painting (Removes items from images)

๐Ÿ“Œ Image out-painting (Generates more images based on one)

๐Ÿ“Œ Layout/Segmentation (Image generation)

๐Ÿ“Œ Class image generation (generates images following a single class, for example a car)

All this sounds nice, but why is it relevant?

Even though itโ€™s really early in the life of Diffusion models, they are already performing on par or better than GANs -one of the strongest options for image generation-. Imagine all the possibilities that open up ๐Ÿคฉ ๐Ÿคฉ

Some ideas that come to mind:

๐Ÿ“Œ Infinite stock images

๐Ÿ“Œ Texture generation for games

๐Ÿ“Œ Artist inspiration for creating art

๐Ÿ“Œ Logo creation

๐Ÿ“Œ Clothing Fashion inspiration

๐Ÿ“ŒImage colorization

At Marvik we have extensive experience using #GAN models and have some very exciting ideas on how to leverage this new era of generative AI ๐Ÿ™Œ๐Ÿป

โžก๏ธTo learn more about Stability.ai: https://bit.ly/3Br5fBJย 

โžก๏ธTo access the full paper: https://bit.ly/3QpeV3Tย 

โžก๏ธTo access the code: https://bit.ly/3QorcG1ย 

Amazonโ€™s new AlexaTM 20B

Another breakthrough in the field of #NLP (#naturallanguageprocessing) ๐Ÿš€

Amazonโ€™s new multilingual language model (AlexaTM 20B) beats GPT-3 and other decoder-based language models in several NLP tasks ๐Ÿคฉ

๐ŸŸข Highlights

๐Ÿ“Œ Achieves state-of-the-art performance om 1-shot summarisation tasks and outperforms larger #PaLM decoder model with 540 billion parameters

๐Ÿ“Œ In zero-shot setting, it even outperforms GPT3 on #SuperGLUE and #SQuADv2 datasets.

๐Ÿ“Œ It also offers state-of-art performance on multilingual tasks like #XNLI, #XCOPA, #Paws-X, and #XWinograd.

โžก๏ธ Github repository: https://bit.ly/3QDOuHV

โžก๏ธ More on AlexaTM 20B: https://bit.ly/3RY7qSP

OpenAIโ€™s Whisper

๐Ÿš€Another milestone in the realm of speech recognition ๐Ÿš€

OpenAI is open-sourcing #Whisper, an automatic speech recognition (#ASR) system that approaches human level robustness and accuracy on English speech recognition.


๐Ÿ“Œ Trained on 680,000 hours of multilingual and multitask supervised data collected from the web

๐Ÿ“Œ Enables transcription in multiple languages and translation from those languages into English

๐Ÿ“Œ The use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language

๐Ÿ“Œ About โ…“ of the dataset is non-English

๐Ÿ“Œ ASR shows strong results for nearly 10 languages

๐Ÿ“Œ Models & inference code are open-sourced

โžก๏ธ More on Whisper here: https://bit.ly/3R9tvgm

Size recommendation for e-commerce fashion

To all online shoppers out there, have you ever struggled to find your perfect fit? ๐Ÿค”

In the global fashion market, the sizing of garments tends to vary from brand to brand and even within a single brandโ€™s collection. Shoppers must rely on sizing charts, product descriptions and images ๐Ÿ‘š๐Ÿ‘–๐Ÿ‘”. As users, this is a great challenge since the human body, with its diversity of shapes and dimensions, does not follow a standard pattern๐Ÿงโ€โ™‚๏ธ๐Ÿง. This often leads to over-ordering, returns and purchases that donโ€™t meet consumersโ€™ needs.

๐Ÿ’กAs e-commerce becomes the predominant form of fashion retail, there is an urgent need for fashion brands to solve this challenge, creating experiences that remove customer friction and make shopping fast and seamless.

๐ŸŸข At Marvik we are working with #deeplearning and #computervision techniques to build a size recommendation system that allows ecommerce buyers to know their body measurements and their recommended clothing size simply by uploading a pair of pictures ๐Ÿ‘ฉ๐Ÿป๐Ÿง”๐Ÿฝโ€โ™‚๏ธ

We are reaching out to our community to ask for your support on this exciting project ๐Ÿ™๐Ÿป

โžก๏ธ To participate in this initiative, please fill out this form https://bit.ly/3dNWBo1

