This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.net.URL | |
import scala.xml.XML | |
val site = new URL("http://michel-daviot.blogspot.fr/") | |
XML.load(site) | |
//Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 7; columnNumber: 265; | |
//The entity name must immediately follow the '&' in the entity reference. |
It is quite easy however to overcome this limitation by using the library tagsoup which allows to "fix" the HTML markup to make it look like XML.
You will add this dependency to your pom.xml :
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<dependency> | |
<groupId>org.ccil.cowan.tagsoup</groupId> | |
<artifactId>tagsoup</artifactId> | |
<version>1.2.1</version> | |
</dependency> |
This simple object can then be used to load an URL containing HTML markup as if it was XML.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.net.URL | |
import scala.xml.XML | |
import org.xml.sax.InputSource | |
import scala.xml.parsing.NoBindingFactoryAdapter | |
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl | |
import java.net.HttpURLConnection | |
import scala.xml.Node | |
object HTML { | |
lazy val adapter = new NoBindingFactoryAdapter | |
lazy val parser = (new SAXFactoryImpl).newSAXParser | |
def load(url: URL, headers: Map[String, String] = Map.empty): Node = { | |
val conn = url.openConnection().asInstanceOf[HttpURLConnection] | |
for ((k, v) <- headers) | |
conn.setRequestProperty(k, v) | |
val source = new InputSource(conn.getInputStream) | |
adapter.loadXML(source, parser) | |
} | |
} |
Note that it also allows to set HTTP headers to the request, for instance if you want to use a cookie or a sessionId to get logged in.
Example calling code, which will list all HTML links from the loaded page :
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.net.URL | |
val site = new URL("http://michel-daviot.blogspot.fr/") | |
val content = HTML.load(site) | |
for ( | |
a <- content \\ "a"; | |
href = a.attribute("href"); | |
if href.isDefined | |
) println(href.get) |
And as a bonus, an object to force proxy settings from the code (for sure, you could also set it from the command line).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
object SetProxy { | |
def apply(proxyConfig: (String, Int)) { | |
val (host, port) = proxyConfig | |
for (protocol <- Seq("http", "https")) { | |
System.setProperty(s"$protocol.proxyPort", port.toString) | |
System.setProperty(s"$protocol.proxyHost", host) | |
} | |
} | |
} |