As AI use cases expand across the social sector, robust evaluation systems are critical to safeguard vulnerable users, improve system performance, and ensure meaningful positive impact. This session addresses the growing need to support development leaders in designing cost-effective AI evaluation pipelines that enhance benefits while minimizing potential harm. It will draw on culturally responsive and context-specific evaluation approaches aimed at improving the safety and reliability of LLMs for diverse, multilingual communities.